diff --git a/.gitignore b/.gitignore index ec1fc82..de66697 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ lib/cgiLog.txt thang/* tmp/* archive/* +doc/.htaccess diff --git a/CHANGELOG.txt b/CHANGELOG.txt index b8643ad..df0aabc 100644 --- a/CHANGELOG.txt +++ b/CHANGELOG.txt @@ -1,3 +1,6 @@ +100901 (done by Thang) +- Incorporate BiblioScrip (http://github.com/mromanello/BiblioScript) and BibUtils (http://www.scripps.edu/~cdputnam/software/bibutils/) + 100401e (done by Min on 100725) - Minor changes to paths and to make it work again from wing.nus directory (moved from forecite, due to restructuring of WING server) diff --git a/README.TXT b/README.TXT index d9e7c58..68103ea 100644 --- a/README.TXT +++ b/README.TXT @@ -59,6 +59,7 @@ bin/ - Binaries / scripts for running ParsCit /sectLabel/redo.sectLabel.pl - Cross Validation training script and training notes (at end of script) for SectLabel + /BiblioScript - Include BiblioScript and BibUtils codes # Thang v100901 CHANGELOG.TXT - Changes between versions of the code. crfpp/ - The CRF++ machine learning package used within ParsCit /traindata - Sample training data for the CRF++ code of ParsCit, diff --git a/bin/#W00-0102.body# b/bin/#W00-0102.body# deleted file mode 100644 index f3e87c5..0000000 --- a/bin/#W00-0102.body# +++ /dev/null @@ -1,4094 +0,0 @@ -Using -Long -Runs -as -Predictors -of -Semantic -Coherence -in -a -Partial -Document -Retrieval -System -Hyopil -Shin -Computing -Research -Laboratory, -NMSU -PO -Box -30001 -Las -Cruces, -NM, -88003 -hshin@crl.nmsu.edu -Jerrold -F. -Stach -Computer -Science -Telecommunications, -UMKC -5100 -Rockhill -Road -Kansas -City, -MO, -64110 -stach@cstp.umkc.edu -Abstract -We -propose -a -method -for -dealing -with -semantic -complexities -occurring -in -information -retrieval -systems -on -the -basis -of -linguistic -observations. -Our -method -follows -from -an -analysis -indicating -that -long -runs -of -content -words -appear -in -a -stopped -document -cluster, -and -our -observation -that -these -long -runs -predominately -originate -from -the -prepositional -phrase -and -subject -complement -positions -and -as -such, -may -be -useful -predictors -of -semantic -coherence. -From -this -linguistic -basis, -we -test -three -statistical -hypotheses -over -a -small -collection -of -documents -from -different -genre. -By -coordinating -thesaurus -semantic -categories -(SEMCATs) -of -the -long -run -words -to -the -semantic -categories -of -paragraphs, -we -conclude -that -for -paragraphs -containing -both -long -runs -and -short -runs, -the -SEMCAT -weight -of -long -runs -of -content -words -is -a -strong -predictor -of -the -semantic -coherence -of -the -paragraph. -Introduction -One -of -the -fundamental -deficiencies -of -current -information -retrieval -methods -is -that -the -words -searchers -use -to -construct -terms -often -are -not -the -same -as -those -by -which -the -searched -information -has -been -indexed. -There -are -two -components -to -this -problem, -synonymy -and -polysemy -(Deerwester -et. -al., -1990). -By -definition -of -polysemy, -a -document -containing -the -search -terms -or -indexed -with -the -search -terms -is -not -necessarily -relevant. -Polysemy -contributes -heavily -to -poor -precision. -Attempts -to -deal -with -the -synonymy -problem -have -relied -on -intellectual -or -automatic -term -expansion, -or -the -construction -of -a -thesaurus. -Also -the -ambiguity -of -natural -language -causes -semantic -complexities -that -result -in -poor -precision. -Since -queries -are -mostly -formulated -as -words -or -phrases -in -a -language, -and -the -expressions -of -a -language -are -ambiguous -in -many -cases, -the -system -must -have -ways -to -disambiguate -the -query. -In -order -to -resolve -semantic -complexities -in -information -retrieval -systems, -we -designed -a -method -to -incorporate -semantic -information -into -current -IR -systems. -Our -method -( -1 -) -adopts -widely -used -Semantic -Information -or -Categories, -(2) -calculates -Semantic -Weight -based -on -probability, -and -(3) -(for -the -purpose -of -verifying -the -method) -performs -partial -text -retrieval -based -upon -Semantic -Weight -or -Coherence -to -overcome -cognitive -overload -of -the -human -agent. -We -make -two -basic -assumptions: -1. -Matching -search -terms -to -semantic -categories -should -improve -retrieval -precision. -2. -Long -runs -of -content -words -have -a -linguistic -basis -for -Semantic -Weight -and -can -also -be -verified -statistically. -1 -A -Brief -Overview -of -Previous -Approaches -There -have -been -several -attempts -to -deal -with -complexity -using -semantic -information. -These -methods -are -hampered -by -the -lack -of -dictionaries -containing -proper -semantic -categories -for -classifying -text. -Semantic -methods -designed -by -Boyd -et. -al. -(1994) -and -Wendlandt -et. -al. -(1991) -demonstrate -only -simple -examples -and -are -restricted -to -small -numbers -of -words. -In -order -to -overcome -this -6 -deficiency, -we -propose -to -incorporate -the -structural -information -of -the -thesaurus, -semantic -categories -(SEMCATs). -However, -we -must -also -incorporate -semantic -categories -into -current -IR -systems -in -a -compatible -manner. -The -problem -we -deal -with -is -partial -text -retrieval -when -all -the -terms -of -the -traditional -vector -equations -are -not -known. -This -is -the -case -when -retrieval -is -associated -with -a -near -real -time -filter, -or -when -the -size -or -number -of -documents -in -a -corpus -is -unknown. -In -such -cases -we -can -retrieve -only -partial -text, -a -paragraph -or -page. -But -since -there -is -no -document -wide -or -corpus -wide -statistics, -it -is -difficult -to -judge -whether -or -not -the -text -fragment -is -relevant. -The -method -we -employ -in -this -paper -identifies -semantic -"hot -spots" -in -partial -text. -These -"hot -spots" -are -loci -of -semantic -coherence -in -a -paragraph -of -text. -Such -paragraphs -are -likely -to -convey -the -central -ideas -of -the -document, -We -also -deal -with -the -computational -aspects -of -partial -text -retrieval. -We -use -a -simple -stop/stem -method -to -expose -long -runs -of -context -words -that -are -evaluated -relative -to -the -search -terms. -Our -goal -is -not -to -retrieve -a -highly -relevant -sentence, -but -rather -to -retrieve -a -portion -of -text -that -is -semantically -coherent -with -respect -to -the -search -terms. -This -locale -can -be -returned -to -the -searcher -for -evaluation -and -if -it -is -relevant, -the -search -terms -can -be -refined. -This -approach -is -compatible -with -Latent -Semantic -Indexing -(LSI) -for -partial -text -retrieval -when -the -terms -of -the -vector -space -are -not -known. -LSI -is -based -on -a -vector -space -information -retrieval -method -that -has -demonstrated -improved -performance -over -the -traditional -vector -space -techniques. -So -when -incorporating -semantic -information, -it -is -necessary -to -adopt -existing -mathematical -methods -including -probabilistic -methods -and -statistical -methods. -2 -Theoretical -Background -2.1 -Long -Runs -Partial -Information -Retrieval -has -to -with -detection -of -main -ideas. -Main -ideas -are -topic -sentences -that -have -central -meaning -to -the -text. -Our -method -of -detecting -main -idea -paragraphs -extends -from -Jang -(1997) -who -observed -that -after -stemming -and -stopping -a -document, -long -runs -of -content -words -cluster. -Content -word -runs -are -a -sequence -of -content -words -with -a -function -word(s) -prefix -and -suffix. -These -runs -can -be -weighted -for -density -in -a -stopped -document -and -vector -processed. -We -observed -that -these -long -content -word -runs -generally -originate -from -the -prepositional -phrase -and -subject -complement -positions, -providing -a -linguistic -basis -for -a -dense -neighbourhood -of -long -runs -of -content -words -signalling -a -semantic -locus -of -the -writing. -We -suppose -that -these -neighbourhoods -may -contain -main -ideas -of -the -text. -In -order -to -verify -this, -we -designed -a -methodology -to -incorporate -semantic -features -into -information -retrieval -and -examined -long -runs -of -content -words -as -a -semantic -predictor. -We -examined -all -the -long -runs -of -the -Jang -(1997) -collection -and -discovered -most -of -them -originate -from -the -prepositional -phrase -and -subject -complement -positions. -According -to -Halliday -(1985), -a -preposition -is -explained -as -a -minor -verb. -It -functions -as -a -minor -Predicator -having -a -nominal -group -as -its -complement. -Thus -the -internal -structure -of -'across -the -lake' -is -like -that -of -'crossing -the -lake', -with -a -non-finite -verb -as -Predicator -(thus -our -choice -of -3 -words -as -a -long -run). -When -we -interpret -the -preposition -as -a -"minor -Predicator" -and -"minor -Process", -we -are -interpreting -the -prepositional -phrase -as -a -kind -of -minor -clause. -That -is, -prepositional -phrases -function -as -a -clause -and -their -role -is -predication. -Traditionally, -predication -is -what -a -statement -says -about -its -subject. -A -named -predication -corresponds -to -an -externally -defined -function, -namely -what -the -speaker -intends -to -say -his -or -her -subject, -i.e. -their -referent. -If -long -runs -largely -appear -in -predication -positions, -it -would -suggest -that -the -speaker -is -saying -something -important -and -the -longer -runs -of -content -words -would -signal -a -locus -of -the -speaker's -intention. -Extending -from -the -statistical -analysis -of -Jang -(1997) -and -our -observations -of -those -long -runs -in -the -collection, -we -give -a -basic -assumption -of -OUT -study: -Long -runs -of -content -words -contain -significant -semantic -information -that -a -speaker -wants -to -express -and -focus, -and -thus -are -semantic -indicators -or -loci -or -main -ideas. -7 -In -this -paper, -we -examine -the -SEMCAT -values -of -long -and -short -runs, -extracted -from -a -random -document -of -the -collection -in -Jang -(1997), -to -determine -if -the -SEMCAT -weights -of -long -runs -of -content -words -are -semantic -predictors. -2.2 -SEMCATs -We -adopted -Roget's -Thesaurus -for -our -basic -semantic -categories -(SEMCATs). -We -extracted -the -semantic -categories -from -the -online -Thesaurus -for -convenience. -We -employ -the -39 -intermediate -categories -as -basic -semantic -information, -since -the -6 -main -categories -are -too -general, -and -the -many -sub-categories -are -too -narrow -to -be -taken -into -account. -We -refer -to -these -39 -categories -as -SEMCATs. -Table -1: -Semantic -Categories -(SEMCATs) - -Abbreviation -Full -Description -1 -AFIG -Affection -in -General -2 -ANT -Antagonism -3 -CAU -Causation -4 -CHN -Change -5 -COIV -Conditional -Intersocial -Volition -6 -CRTH -Creative -Thought -7 -DIM -Dimensions - -EXIS -Existence -9 -EXOT -Extension -of -Thought -1° -FORM -Form -11 -GINV -General -Inter -social -Volition -12 -INOM -Inorganic -Matter -13 -MECO -Means -of -Communication -14 -MFRE -Materials -for -Reasoning -15 -MIG -Matter -ingeneral -16 -MOAF -Moral -Affections -17 -MOCO -Modes -of -Communication -18 -MOT -Motion -19 -NOIC -Nature -of -Ideas -Communicated -20 -NUM -Number -21 -opm -Operations -of -Intelligence - -In -General -22 -ORD -Order -23 -ORGM -Organic -Matter -24 -pEAF -Personal -Affections -25 -PORE -Possessive -Relations -26 -PRCO -Precursory -Conditions -and -Operations -27 -PRVO -Prospective -Volition -28 -QUAN -Quantity -29 -REAF -Religious -Affections -ao -RELN -Relation -31 -REOR -Reasoning -Organization -32 -REPR -Reasoning -Process -33 -ROVO -Result -of -Voluntary -Action -34 -SIG -Space -in -General -35 -S -IVO -Special -Inter -social -Volition -36 -SYAF -Sympathetic -Affections -37 -TIME -Time -38 -VOAC -Voluntary -Action -39 -VOIG -Volition -in -General -2.3 -Indexing -Space -and -Stop -Lists -Many -of -the -most -frequently -occurring -words -in -English, -such -as -"the," -"of," -"and," -"to," -etc. -are -non-discriminators -with -respect -to -information -filtering. -Since -many -of -these -function -words -make -up -a -large -fraction -of -the -text -of -Most -documents, -their -early -elimination -in -the -indexing -process -speeds -processing, -saves -significant -amounts -of -index -space -and -does -not -compromise -the -filtering -process. -In -the -Brown -Corpus, -the -frequency -of -stop -words -is -551,057 -out -of -1,013,644 -total -words. -Function -words -therefore -account -for -about -54.5% -of -the -tokens -in -a -document. -The -Brown -Corpus -is -useful -in -text -retrieval -because -it -is -small -and -efficiently -exposes -content -word -runs. -Furthermore, -minimizing -the -document -token -size -is -very -important -in -NLP- -based -methods, -because -NLP-based -methods -usually -need -much -larger -indexing -spaces -than -statistical-based -methods -due -to -processes -for -tagging -and -parsing. -3 -Experimental -Basis -In -order -to -verify -that -long -runs -contribute -to -resolve -semantic -complexities -and -can -be -used -as -predictors -of -semantic -intent, -we -employed -a -probabilistic, -vector -processing -methodology. -3.1 -Revised -Probability -and -Vector -Processing -In -order -to -understand -the -calculation -of -SEMCATs, -it -is -helpful -to -look -at -the -structure -8 -of -a -preprocessed -document. -One -document -"Barbie" -in -the -Jang -(1997) -collection -has -a -total -of -1,468 -words -comprised -of -755 -content -words -and -713 -function -words. -The -document -has -17 -paragraphs. -Filtering -out -function -words -using -the -Brown -Corpus -exposed -the -runs -of -content -words -as -shown -in -Figure -1. -Figure -1: -Preprocessed -Text -Document -BARBIE -* -* -* -* -FAVORITE -COMPANION -DETRACTORS -LOVE -* -* -* -PLASTIC -PERFECTION -* -FASHION -DOLL -* -* -IMPOSSIBLE -FIGURE -* -LONG -* -* -* -POPULAR -GIRL -* -MA -ITEL -* -WORLD -* -TOYMAKER -* -PRODUCTS -RANGE -* -FISHER -PRICE -INFANT -* -SALES -* -* -* -TALL -MANNEQUIN -* -BARBIE -* -* -AGE -* -* -* -BEST -SELLING -GIRLS -BRAND -* -* -POISED -* -STRUT -* -* -CHANGE -* -* -MALE -DOMINATED -WORLD -* -MULTIMEDIA -SOFTWARE -* -VIDEO -GAMES -In -Figure -1, -asterisks -occupy -positions -where -function -words -were -filtered -out. -The -bold -type -indicates -the -location -of -the -longest -runs -of -content -words. -The -run -length -distribution -of -Figure -1 -is -shown -below: -Table -2: -Distribution -of -Content -Run -Lengths -in -a -sam -le -Document -Run -Length -Frequency -1 -II -2 -8 -3 -2 -4 -2 -The -traditional -vector -processing -model -requires -the -following -set -of -terms: -œôó˘ -(dl) -the -number -of -documents -in -the -collection -that -each -word -occurs -in -œôó˘ -(id° -the -inverse -document -frequency -of -each -word -determined -by -logio(N/df) -where -N -is -the -total -number -of -documents. -If -a -word -appears -in -a -query -but -not -in -a -document, -its -idf -is -undefined. -œôó˘ -The -category -probability -of -each -query -word. -Wendlandt -(1991) -points -out -that -it -is -useful -to -retrieve -a -set -of -documents -based -upon -key -words -only, -and -then -considers -only -those -documents -for -semantic -category -and -attribute -analysis. -Wendlandt -(1991) -appends -the -s -category -weights -to -the -t -term -weights -of -each -document -vector -Di -and -the -Query -vector -Q. -Since -our -basic -query -unit -is -a -paragraph, -document -frequency -(dl) -and -inverse -document -frequency -(idf) -have -to -be -redefined. -As -we -pointed -out -in -Section -1, -all -terms -are -not -known -in -partial -text -retrieval. -Further, -our -approach -is -based -on -semantic -weight -rather -than -word -frequency. -Therefore -any -frequency -based -measures -defined -by -Boyd -et -al. -(1994) -and -Wendlandt -(1991) -need -to -be -built -from -the -probabilities -of -individual -semantic -categories. -Those -modifications -are -described -below. -As -a -simplifying -assumption, -we -assume -SEMCATs -have -a -uniform -probability -distribution -with -regard -to -a -word. -3.2 -Calculating -SEMCATs -Our -first -task -in -computing -SEMCAT -values -was -to -create -a -SEMCAT -dictionary -for -our -method. -We -extracted -SEMCATs -for -every -word -from -the -World -Wide -Web -version -of -Roget's -thesaurus. -SEMCATs -give -probabilities -of -a -word -corresponding -to -a -semantic -category. -The -content -word -run -'favorite -companion -detractors -love' -is -of -length -4. -Each -word -of -the -run -maps -to -at -least -one -SEMCAT. -The -word -`favorite' -maps -to -categories -`PEAF -and -SYAF'. -'companion' -maps -to -categories -'ANT, -MECO, -NUM, -ORD, -ORGM, -PEAF, -PRVO, -QUAN, -and -SYAF'. -'detractor' -maps -to -`MOAF'. -'love' -maps -to -`AFIG, -ANT, -MECO, -MOAF, -MOCO, -ORGM, -PEAF, -PORE, -PRVO, -SYAF, -and -VOIG'. -We -treat -the -long -runs -as -a -semantic -core -from -which -to -calculate -SEMCAT -values. -SEMCAT -weights -are -calculated -based -on -the -following -equations. -Eq.1 -Pik(Probability) -- -The -likelihood -of -SEMCAT -Si -occurring -due -to -the -le -trigger. -For -example, -assuming -a -uniform -probability -distribution, -the -category -PEAF -triggered -by -the -word -favorite -above, -has -the -following -probability: -PPEAF, -favorite -= -0.5(112) -Eq.2 -Sw; -(SEMCAT -Weights -in -Long -runs) -is -the -sum -of -each -SEMCATO -weight -of -long -runs -based -on -their -probabilities. -In -the -above -example, -the -long -run -9 -'favorite -companion -detractors -love,' -the -SEMCAT -`MOAF' -has -SWMOAF -(detractor(1) -love(.09)) -= -1.09. -We -can -write; -SWi -= -I -p,, -Eq.3 -edwj -(Expected -data -weights -in -a -paragraph) -- -Given -a -set -of -N -content -words -(data) -in -a -paragraph, -the -expected -weight -of -the -SEMCATs -of -long -runs -in -a -paragraph -is: -edwj -= -pi; -,=1 -Eq.4 -idwj -(Inverse -data -weights -in -a -paragraph) -- -The -inverse -data -weight -of -SEMCATs -of -long -runs -for -a -set -of -N -content -words -in -a -paragraph -is -N -), -ichvi=logio((- -edwi -Eq.5 -Weight(W) -- -The -weight -of -SEMCAT -Si -in -a -paragraph -is -W; -= -Swjxidw; -Eq.6 -Relevance -Weights -(Semantic -Coherence) -Our -method -performs -the -following -steps: -1. -calculate -the -SEMCAT -weight -of -each -long -content -word -run -in -every -paragraph -(Sw) -2. -calculate -the -expected -data -weight -of -each -paragraph -(edw) -3. -calculate -the -inverse -expected -data -weight -of -each -paragraph -(idw) -4. -calculate -the -actual -weight -of -each -paragraph -(Swxidw) -5. -calculate -coherence -weights -(total -relevance) -by -summing -the -weights -of -(Swxidw). -In -every -paragraph, -extraction -of -SEMCATs -from -long -runs -is -done -first. -The -next -step -is -finding -the -same -SEMCATs -of -long -runs -through -every -word -in -a -paragraph -(expected -data -weight), -then -calculate -idw, -and -finally -Swxidw. -The -final, -total -relevance -weights -are -an -accumulation -of -all -weights -of -SEMCATs -of -content -words -in -a -paragraph. -Total -relevance -tells -how -many -SEMCATs -of -the -Query's -long -runs -appear -in -a -paragraph. -Higher -values -imply -that -the -paragraph -is -relevant -to -the -long -runs -of -the -Query. -The -following -is -a -program -output -for -calculating -SEMCAT -weights -for -an -arbitrary -long -run: -"SEVEN -INTERACTIVE -PRODUCTS -LED" -SEMCAT: -EXOT -Sw -: -1.00 -edw -: -1.99 -idw -: -1.44 -Swxidw -: -1.44 -SEMCAT: -GINV -Sw -: -0.33 -edw -: -1.62 -idw -: -1.53 -Swxidw -: -0.51 -SEMCAT: -MOT -Sw -: -0.20 -edw -: -0.71 -idw -: -1.89 -Swxidw -: -0.38 -SEMCAT: -NUM -Sw -: -0.20 -edw -: -1.76 -idw -: -1.49 -Swxidw -: -0.30 -SEMCAT: -ORGM -Sw -: -0.20 -edw -: -1.67 -idw -1.52 -Swxidw -; -0,30 -SEMCAT: -PEAF -Sw -: -0.53 -edw -: -1.50 -idw -: -1.56 -Swxidw -: -0.83 -SEMCAT: -REAF -Sw -: -0.20 -edw -: -0.20 -idw -: -2.44 -Swxidw -: -0.49 -SEMCAT: -SYAF -Sw -: -0.33 -edw -: -1.19 -idw -: -1.66 -Swxidw -: -0.55 -Total -(Swxidw) -: -4,79 -4 -Experimental -Results -The -goal -of -employing -probability -and -vector -processing -is -to -prove -the -linguistic -basis -that -long -runs -of -content -words -can -be -used -as -predictors -of -semantic -intent -But -we -also -want -to -exploit -the -computational -advantage -of -removing -the -function -words -from -the -document, -which -reduces -the -number -of -tokens -processed -by -about -50% -and -thus -reduces -vector -space -and -probability -computations. -If -it -is -true -that -long -runs -of -content -words -are -predictors -of -semantic -coherence, -we -can -further -reduce -the -complexity -of -vector -computations: -(1) -by -eliminating -those -paragraphs -without -long -runs -from -consideration, -(2) -within -remaining -paragraphs -with -long -runs, -computing -and -summing -the -semantic -coherence -of -the -longest -runs -only, -(3) -ranking -the -eligible -paragraphs -for -retrieval -based -upon -their -semantic -weights -relative -to -the -query. -Jang -(1997) -established -that -the -distribution -of -long -runs -of -content -words -and -short -runs -of -content -words -in -a -collection -of -paragraphs -are -drawn -from -different -populations. -This -implies -10 -that -either -long -runs -or -short -runs -are -predictors, -but -since -all -paragraphs -contain -short -runs, -i.e. -a -single -content -word -separated -by -function -words, -only -long -runs -can -be -useful -predictors. -Furthermore, -only -long -runs -as -we -define -them -can -be -used -as -predictors -because -short -runs -are -insufficient -to -construct -the -language -constructs -for -prepositional -phrase -and -subject -complement -positions. -If -short -runs -were -discriminators, -the -linguistic -assumption -of -this -research -would -be -violated. -The -statistical -analysis -of -Jang -(1997) -does -not -indicate -this -to -be -the -case. -To -proceed -in -establishing -the -viability -of -our -approach, -we -proposed -the -following -experimental -hypotheses: -(111) -The -SEMCAT -weights -for -long -runs -of -content -words -are -statistically -greater -than -weights -for -short -runs -of -content -words. -Since -each -content -word -can -map -to -multiple -SEMCATs, -we -cannot -assume -that -the -semantic -weight -of -a -long -run -is -a -function -of -its -length. -The -semantic -coherence -of -long -runs -should -be -a -more -granular -discriminator. -(112) -For -paragraphs -containing -long -runs -and -short -runs, -the -distribution -of -long -run -SEMCAT -weights -is -statistically -different -from -the -distribution -of -short -run -SEMCAT -weights. -(H3) -There -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -semantic -coherence -of -a -paragraph, -the -total -paragraph -SEMCAT -weight. -A -detailed -description -of -these -experiments -and -their -outcome -are -described -in -Shin -(1997, -1999). -The -results -of -the -experiments -and -the -implications -of -those -results -relative -to -the -method -we -propose -are -discussed -below. -Table -3 -gives -the -SEMCAT -weights -for -seventeen -paragraphs -randomly -chosen -from -one -document -in -the -collection -of -Jang -(1997). -Table -3: -SEMCAT -Weights -of -17 -Paragraphs -Chosen -Randomly -From -a -Collection -Paragraph -Short -Runs -Long -Runs - -Weight -Weight -1 -29.84 -18.60 -2 -31.29 -12.81 -3 -23.29 -4.25 -4 -23.94 -11.63 -5 -34.63 -35.00 -6 -22.85 -03.32 -7 -21.74 -00.00 -8 -35.84 -15.94 -9 -30.15 -00.00 -10 -13.40 -00.00 -11 -23.01 -07.82 -12 -31.69 -04.79 -13 -36.54 -00.00 -14 -17.91 -10.55 -15 -19.70 -05.83 -16 -17.11 -00.00 -17 -31.86 -00.00 -The -data -was -evaluated -using -a -standard -two -way -F -test -and -analysis -of -variance -table -with -ot -= -.05. -The -analysis -of -variance -table -for -the -paragraphs -in -Table -3 -is -shown -in -Table -4. -Table -4: -Analysis -of -Variance -for -Table -2 -Data -Variation -Degrees -of -Mean -Square -F - -Freedom -Between -1 -2904.51 -68.56 -Treatments -V, -= -2904.51 -Between -Blocks -16 -93.92 -2.21 -yr -= -1502.83 -Residual -or -16 -42.36 -Random -V,= -677.77 -Total -33 -V -= -5085.11 -At -the -.05 -significance -level, -Fa -05 -= -4.49 -for -1,16 -degrees -of -freedom. -Since -68.56 -> -4.49 -we -reject -the -assertion -that -column -means -(run -weights) -are -equal -in -Table -2. -Long -run -and -short -run -weights -come -from -different -populations. -We -accept -Hl. -For -the -between -paragraph -treatment, -the -row -means -(paragraph -weights) -have -an -F -value -of -2.21. -At -the -.05 -significance -level, -Fa -. -05 -= -2.28 -for -16,16 -degrees -of -freedom. -Since -2.21 -< -2.28 -we -cannot -reject -the -assertion -that -there -is -no -significant -difference -in -SEMCAT -weights -between -paragraphs. -That -is, -paragraph -weights -do -not -appear -to -be -taken -from -different -populations, -as -do -the -long -run -and -short -run -weight -distributions. -Thus, -the -semantic -weight -11 -of -the -content -words -in -a -paragraph -cannot -be -used -to -predict -the -semantic -weight -of -the -paragraph. -We -therefore -proceed -to -examine -H2. -Notice -that -two -paragraphs -in -Table -2 -are -without -long -runs. -We -need -to -repeat -the -analysis -of -variance -for -only -those -paragraphs -with -long -runs -to -see -if -long -runs -are -discriminators. -Table -5 -summarizes -those -paragraphs. -Table -5: -SEMCAT -weights -of -11 -paragraphs -containing -Ion -runs -and -short -runs -Paragraph -Short -Runs -Long -Runs - -Weight -Weight -1 -29.84 -18.60 -2 -31.29 -12.81 -3 -23.29 -4.25 -4 -23.94 -11,63 -5 -34.63 -35.00 -6 -22.85 -03.32 -8 -35.84 -15.94 -11 -23.01 -07.82 -12 -31.69 -04.79 -14 -17.91 -10.55 -15 -19.70 -05.83 -This -data -was -evaluated -using -a -standard -two -way -F -test -and -analysis -of -variance -with -a -= -.05. -The -analysis -of -variance -table -for -the -paragraphs -in -Table -5 -follows. -Table -6: -Analysis -of -Variance -for -Table -5 -Data -Variation -._ -Mean -Square -F - -Degrees - -of -Freedom -Between -Treatments -1 -1430.98 -291.44 -V= -1430.98 -Between -Blocks -10 -94.40 -19.22 -V= -944.05 -Residual -or -10 -4.91 -Random -V,...- -49.19 -Total -21 -V -= -2424.26 -At -the -.05 -significance -level, -F. -.05 -= -4.10 -for -2,10 -degrees -of -freedom. -4.10 -< -291.44. -At -the -.05 -significance -level, -F. -= -2.98 -for -10,10 -degrees -of -freedom. -2.98 -< -19.22. -For -paragraphs -in -a -collection -containing -both -long -and -short -runs: -the -SEMCAT -weights -of -the -long -runs -and -short -runs -are -drawn -from -different -distributions. -We -accept -112. -For -paragraphs -containing -long -runs -and -short -runs, -the -distributions -of -long -run -SEMCAT -weights -is -different -from -the -distribution -of -short -run -SEMCAT -weights. -We -know -from -the -linguistic -basis -for -long -runs -that -short -runs -cannot -be -used -as -predictors. -We -therefore -proceed -to -examine -the -Pearson -correlation -between -the -long -run -SEMCAT -weights -and -paragraph -SEMCAT -weights -for -those -paragraphs -with -both -long -and -short -content -word -runs. -Table -7: -Correlation -of -Long -Run -SEMCAT -Wei -hts -to -Para -ra -h -SEMCAT -Weight -Paragraph -Long -Runs -Semantic -Weight -Paragraph -Semantic -Weight -1 -18.60 -48.44 -2 -12.81 -44.10 -3 -4.25 -27.54 -4 -11.63 -35.57 -5 -35.00 -69.63 -6 -03.32 -26.17 -8 -15.94 -51.78 -11 -07.82 -30.83 -12 --04.79 -31.69 -14 -10.55 -28.46 -15 -05.83 -25.53 -The -weights -in -Table -have -a -positive -Pearson -Product -Correlation -coefficient -of -.952. -We -therefore -accept -1-13. -There -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -semantic -coherence -of -a -paragraph, -the -total -paragraph -SEMCAT -weight. -5. -Conclusion -This -research -tested -three -statistical -hypotheses -extending -from -two -observations: -(1) -fang -(1997) -observed -the -clustering -of -long -runs -of -content -words -and -established -the -distribution -of -long -run -lengths -and -short -run -lengths -are -drawn -from -different -populations, -(2) -our -observation -that -these -long -runs -of -content -words -originate -from -the -prepositional -phrase -and -subject -complement -positions. -According -to -Halliday -(1985) -those -grammar -structures -function -as -12 -minor -predication -and -as -such -are -loci -of -semantic -intent -or -coherence. -In -order -to -facilitate -the -use -of -long -runs -as -predictors, -we -modified -the -traditional -measures -of -Boyd -et -al. -(1994), -Wendlandt -(1991) -to -accommodate -semantic -categories -and -partial -text -retrieval. -The -revised -metrics -and -the -computational -method -we -propose -were -used -in -the -statistical -experiments -presented -above. -The -main -findings -of -this -work -are -1. -the -distribution -semantic -coherence -(SEMCAT -weights) -of -long -runs -is -not -statistically -greater -than -that -of -short -runs, -2. -for -paragraphs -containing -both -long -runs -and -short -runs, -the -SEMCAT -weight -distributions -are -drawn -from -different -populations -3. -there -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -total -SEMCAT -weight -of -the -paragraph -(its -semantic -coherence). -Significant -additional -work -is -required -to -validate -these -preliminary -results. -The -collection -employed -in -Jang -(1997) -is -not -a -standard -Corpus -so -we -have -no -way -to -test -precision -and -relevance -of -the -proposed -method. -The -results -of -the -proposed -method -are -subject -to -the -accuracy -of -the -stop -lists -and -filtering -function. -Nonetheless, -we -feel -the -approach -proposed -has -potential -to -improve -performance -through -reduced -token -processing -and -increased -relevance -through -consideration -of -semantic -qcoherence -of -long -runs. -Significantly, -our -approach -does -not -require -knowledge -of -the -collection. -References diff --git a/bin/2010-ASEO--preprint.body b/bin/2010-ASEO--preprint.body deleted file mode 100644 index 371390e..0000000 --- a/bin/2010-ASEO--preprint.body +++ /dev/null @@ -1,477 +0,0 @@ -Preprint of: JĂśran Beel, Bela Gipp, and Erik Wilde. Academic Search Engine Optimization (ASEO): Optimizing Scholarly Literature for Google Scholar and -Co. Journal of Scholarly Publishing, 41 (2): 176–190, January 2010. doi: 10.3138/jsp.41.2.176. University of Toronto Press. Downloaded from -http://www.sciplore.org -Academic Search Engine Optimization (ASEO): Optimizing -Scholarly Literature for Google Scholar & Co. -DĂśran Beel -Otto-von-Guericke University -FIN / ITI / VLBA-Lab -Germany -beel@sciplore.org -Bela Gipp -Otto-von-Guericke University -FIN / ITI / VLBA-Lab -Germany -gipp@sciplore.org -Erik Wilde -UC Berkeley -School of Information -United States -dret@berkeley.edu -ABSTRACT -This article introduces and discusses the concept of academic -search engine optimization (ASEO). Based on three recently -conducted studies, guidelines are provided on how to optimize -scholarly literature for academic search engines in general and -for Google Scholar in particular. In addition, we briefly discuss -the risk of researchers’ illegitimately ‘over-optimizing’ their -articles. -Keywords -academic search engines, academic search engine optimization, -ASEO, Google Scholar, ranking algorithm, search engine -optimization, SEO -1. INTRODUCTION -Researchers should have an interest in ensuring that their articles -are indexed by academic search engines1 such as Google Scholar, -IEEE Xplore, PubMed, and SciPlore.org, which greatly improves -their ability to make their articles available to the academic -community. Not only should authors take an interest in seeing -that their articles are indexed, they also should be interesting in -where the articles are displayed in the results list. Like any other -type of ranked search results, articles displayed in top positions -are more likely to be read. -This article presents the concept of academic search engine -optimization (ASEO) to optimize scholarly literature for -academic search engines. The first part of the article covers -related work that has been done mostly in the field of general -search engine optimization for Web pages. The second part -defines ASEO and compares it to search engine optimization for -Web pages. The third part provides an overview of ranking -algorithms of academic search engines in general, followed by an -overview of Google Scholar’s ranking algorithm. Finally, -guidelines are provided on how authors can optimize their -articles for academic search engines. This article does not cover -how publishers or providers of academic repositories can -optimize their Web sites and repositories for academic search -engines. The guidelines are based on three studies we have -recently conducted [1-3] and on our experience in developing the -academic search engine Sci Plore.org. -1 In this article we do not distinguish between ‘academic -databases’ and ‘academic search engines’; the latter term is -used as synonym for both. -2. RELATED WORK -On the Web, search engine optimization (SEO) for Web sites is a -common procedure. SEO involves creating or modifying a Web -site in a way that makes it ‘easier for search engines to both -crawl and index [its] content’ [4]. There exists a huge community -that discusses the latest trends in SEO and provides advice for -Webmasters in forums, blogs, and newsgroups.2 Even research -articles and books exist on the subject of SEO [5-10]. When SEO -began, many expressed their concerns that it would promote -spam and tweaking, and, indeed, search-engine spam is a serious -issue [11-26]. Today, however, SEO is a common and widely -accepted procedure and overall, search engines manage to -identify spam quite well. Probably the strongest argument for -SEO is the fact that search engines themselves publish guidelines -on how to optimize Web sites for search engines [4, 27]. But -similar information on optimizing scholarly literature for -academic search engines does not exist, to our knowledge.3 -2.1 Introduction to Academic Search Engine -Optimization (ASEO) -Based on the definition of search engine optimization for Web -pages (SEO), we define academic search engine optimization -(ASEO) as follows: -Academic search engine optimization (ASEO) is the creation, -publication, and modification of scholarly literature in a -way that makes it easier for academic search engines to both -crawl it and index it. -ASEO differs from SEO in four significant respects. First, for -Web search, Google is the market leader in most (Western) -countries [28]. This means that for Webmasters (focusing on -Western Internet users), it is generally sufficient to optimize their -Web sites for Google. In contrast, no such market leader exists -2 E.g. http://www.abakus-internet-marketing.de/foren -http://www.highrankings.com/forum -http://www.seo-guy.com/forum -http://www.seomoz.org/blog -http://www.seo.com/blog -http://www.abakus-internet-marketing.de/seoblog -3 Google Scholar offers some information for publishers on how -to get their articles indexed by Google Scholar and ranked well -[35]. However, this information is superficial in comparison to -other SEO articles, and the information is not aimed at authors. -for searching academic articles, and researchers would need to -optimize their articles for several academic search engines. If -these search engines are based on different crawling and ranking -methods, optimization can become complicated. -Second, Webmasters usually do not need to worry about whether -their site is indexed by a search engine: as long as any Web page -is linked to an already indexed page, it will be crawled and -indexed by Web search engines at some point. The situation is -different in academia, where only a fraction of all published -material is available on the Web and accessible to Web-based -academic search engines such as CiteSeer. Most academic -articles are stored in publishers’ databases; they are part of the -‘academic invisible web,’ [29] and (academic) search engines -usually cannot access and index these articles. A few academic -search engines, such as Scirus and Google Scholar, cooperate -with publishers, but still they do not cover all existing articles -[30-32]. Researchers therefore need to think seriously about how -to get their articles indexed by academic search engines. -Third, Webmasters can alter their pages by adding or replacing -words and links, deleting pages, offering multiple versions with -slight variations, and so on; in this way they can test new -methods and adapt to changes in ranking algorithms. Scholarly -authors can hardly do so: once an article is published, it is -difficult and sometimes impossible to alter it. Therefore, ASEO -needs to be performed particularly carefully. -Finally, Web search engines usually index all text on a Web site, -or at least the majority of it. In contrast, some academic search -engines do not index a document’s full text but instead index -only the title and abstract. This means that for some academic -search engines authors need to focus on the article’s title and -abstract, but in other cases they still have to consider the full text -for other search engines. -2.2 An Overview of Academic Search -(VgiVeH OD VkiVQAlgRJ ❑hO -The basic concept of keyword-based searching is the same for all -major (academic) search engines. Users search for a search term -in a certain document field (e.g., title, abstract, body text), or in -all fields, and all documents containing the search term are listed -on the results page. Academic search engines use different -ranking algorithms to determine in which position the results are -displayed. Some let the user choose one factor on which to rank -the results (common ranking factors are publication date, citation -count, author or journal name and reputation, and relevance of -the document); others combine the ranking factors into one -algorithm, and, more often than not, the user has no influence on -the factor’s weighting. -The relevance of a document is basically a function of how often -the search term occurs in that document and in which part of the -document it occurs. Generally speaking, the more often a search -term occurs in the document, and the more important the -document field is in which the term occurs, the more relevant the -document is considered4. This means that an occurrence in the -4 Some algorithms, such as the BM25(f ), saturate when a word -occurs often in the text [36]. -title is weighted more heavily than an occurrence in the abstract, -which carries more weight than an occurrence in a (sub)heading, -than in the body text, and so on. Possible document fields that -may be weighted differently by academic search engines are:5 -• Title -• Author names -• Abstract -• (Sub)headings -• Author keywords -• Body text -• Tables and figures -• Publication name (name of journal, conference, -proceedings, book, etc.) -• User keywords (Social tags) -• Social annotations -• Description -• Filename -• URI -The metadata of electronic files are especially important for -academic search engines crawling the Web. When a search -engine finds a PDF on the Web, it does not know whether this -PDF represents an academic article, or which one it belongs to; -therefore, the PDF must be identified, and one way to do this is -by extracting the author and title. This can be done by analyzing -the full text of the document or the metadata of the PDF. -It is also important to note that text in figures and tables usually -is indexed only if it is embedded as real text or within a vector -graphic. If text is embedded as a raster graphic (e.g., *.bmp, -*.png, *.gif, *.tif, *.jpg), most, if not all, search engines will not -index the text (see Figures 1 and 2 for an illustration of -differences between vector and raster/bitmap graphics).6 To our -knowledge, none of the major academic search engines currently -considers synonyms. This means that a document containing only -the term ‘academic search engine’ would not be found via a -search for ‘scientific paper search engine’ or ‘academic -database.’ What most academic search engines do is stemming: -words are reduced to their stems (e.g., ‘analysed’ and ‘analysing’ -would be reduced to ‘analyse’). -2.3 *RR��OHchRlDKRODVkiVg❑ADRri ❑hm -Google Scholar is one of those search engines that combine -several factors into one ranking algorithm. The most important -factors are relevance, citation count, author name(s), and name of -publication.7 -5 Some of the data could be retrieved from the document full -text, other from the metadata (of electronic files) -6 Theoretically search engines could index the text in -raster/bitmap graphics, but they would have to apply optical -character recognition (OCR). To our knowledge, no search -engine currently does this, although some are using OCR to -index complete scans of scholarly literature. -7 Google Scholar offers different search functions. For instance, it -is possible to search for ‘related articles’ and ‘recent articles.’ -In this article we focus on the normal ranking algorithm, which -is applied for the standard keyword search. -2.3.1 Relevance -Google Scholar focuses strongly on document titles. Documents -containing the search term in the title are likely to be positioned -near the top of the results list. Google Scholar also seems to -consider the length of a title: In a search for the term ‘SEO,’ a -document titled ‘SEO: An Overview’ would be ranked higher -than one titled ‘Search Engine Optimization (SEO): A Literature -Survey of the Current State of the Art.’ -Although Google Scholar indexes entire documents, the total -search term count in the document has little or no impact. In a -search for ‘recommender systems,’ a document containing fifty -instances of this term would not necessarily be ranked higher -than a document containing only ten instances. -Figure 1: Example of a Vector Graphic -Like other search engines, Google Scholar does not index text in -figures and tables inserted as raster/bitmap graphics, but it does -index text in vector graphics. It is also known that neither -synonyms nor PDF metadata are considered. -2.3.2 Citation Counts -Citation counts play a major role in Google Scholar’s ranking -algorithm, as illustrated in Figure 3, which shows the mean -citation count for each position in Google Scholar.8 It is clear -that, on average, articles in the top positions have significantly -more citations than articles in the lowest positions. This means -that to achieve a good ranking in Google Scholar, many citations -are essential. Google Scholar seems not to differentiate between -self-citations and citations by third parties. -8 On average, articles at position 1 had 834 citations, articles at -position 2 had 552, articles at position 3 had 426, and articles -at position 1000 had fifty-three. The study was based on -1,032,766 results produced by 1050 search queries in -November 2008. For more detail see [1]. -Figure 2: Example of a Bitmap Graphic -2.3.3 Author and Publication Name -If the search query includes an author or publication name, a -document in which either appears is likely to be ranked high. For -instance, seventy-four of the top 100 results of a search for -‘arteriosclerosis and thrombosis cure' we re arti d es about vari ous -(medical) topics from the journal Arteriosclerosis, Thrombosis, -and V ascul ar Bi of ogy, many of whi ch di d not i nd ude the search -term either in the title or in the full text [2]. -Figure 3: Mean Citation Count per Position8 -2.3.4 Other factors -Google Scholar’s standard search does not consider publication -dates. However, Google Scholar offers a special search function -for ‘recent articles,’ which limits results to articles published -within the past five years. Furthermore, Google Scholar claims to -consider both publication and author reputation [33]. However, -we could not research the influence of these factors because of a -lack of data, and therefore we do not consider them here. -2.3. 5 Sources Indexed by Google Scholar -Bert van Heerde, a professional in the field of SEO, uses the -term ‘invitation based search engine’ to describe Google Scholar: -Only articles from trusted sources and articles that are ‘invited’ -(cited) by articles already indexed are included in the database -[34]. ‘Trusted sources,’ in this case, are publishers that cooperate -directly with Google Scholar, as well as publishers and -Webmasters who have requested that Google Scholar crawl their -databases and Web sites.9 -Once an article is included in Google Scholar’s database, Google -Scholar searches the Web for corresponding PDF files, even if a -trusted publisher has already provided the full text. 10 It makes no -difference on which site the PDF is published; for instance, -Google Scholar has indexed PDF files of our articles from the -publisher’s site, our university’s site, our private home pages, -and SciPlore.org. PDFs found on the Web are linked directly on -Google Scholar’s results pages, in addition to the link to the -publisher’s full text (see Figure 4 for an illustrative example). -Figure 4: Linking database entries with external PDFs -If different PDF files of an article exist, Google Scholar groups -them to improve the article’s ranking [35]. For instance, if a -preprint version of an article is available on the author’s Web -page and the final version is available on the publisher’s site, -Google indexes both as one version. If the two versions contain -different words, Google Scholar associates all contained words -with the article. This is an interesting feature that we will -discuss in more detail in the next section. -3. OPTIMIZING SCHOLARLY -LITERATURE FOR GOOGLE SCHOLAR -AND OTHER ACADEMIC SEARCH -ENGINES -3.1 Preparation -In the beginning it is necessary to think about the most important -words that are relevant to the article. It is not possible to -optimize one document for dozens of keywords, so it is better to -choose a few. There are tools that help in selecting the right -keywords, such as Google Trends, Google Insights, Google -Adwords keyword tool, Google Search–based keyword tool, and -Spacky.11 -9 Visit http://www.google.com/support/scholar/bin/request.py to -ask Google Scholar to crawl your Web site containing scholarly -articles. -10 Google Scholar also indexes other file types, such as -PostScript (*.ps), Microsoft Word (*.doc), and MS PowerPoint -(*.ppt). Here we focus on PDF, which is the most common -format for scientific articles. -11 Google Trends http://www.google.com/trends -Google Insights http://www.google.com/insights/search/ -It might be wise not to select those keywords that are most -popular. It is usually a good idea to query the common academic -search engines using each proposed keyword; if the search -already returns hundreds of documents, it may be better to -choose another keyword with less competition. 12 -3.2 Writing Your Article -Once the keywords are chosen, they need to be mentioned in the -right places: in the title, and as often as possible in the abstract -and the body of the text (but, of course, not so often as to annoy -readers). Although in general titles should be fairly short, we -suggest choosing a longer title if there are many relevant -keywords. -Synonyms of important keywords should also be mentioned a few -times in the body of the text, so that the article may be found by -someone who does not know the most common terminology used -in the research field. If possible, synonyms should also be -mentioned in the abstract, particularly because some academic -search engines do not index the document’s full text. -Be consistent in spelling people’s names, taking special care -with names that contain special characters. If names are used -inconsistently, search engines may not be able to identify articles -or citations correctly; as a consequence, citations may be -assigned incorrectly, and articles will not be as highly ranked as -they could be. For instance, JĂśran, Joeran, and Joran are all -correct spellings of the same name (given different transcription -rules), but Google Scholar sees them as three different names. -The article should use a common scientific layout and structure, -including standard sections: introduction, related work, results, -and so on. A common scientific layout and structure will help -Web-based academic search engines to identify an article as -scientific. -Academic search engines, and especially Google Scholar, assign -significant weight to citation counts. Citations influence whether -articles are indexed at al l, and they also influence the ranking of -articles. We do not want to encourage readers to build ‘citation -circles,’ or to take any other unethical action. But any published -articles you have read that relate to your current research paper -should be cited. When referencing your own published work, it is -important to include a link where that work can be downloaded. -This helps readers to find your article and helps academic search -engines to index the referenced article’s full text. Of course, this -can also be done for other articles that have well-known (i.e., -stable and possibly canonical) download locations. -3.3 Preparing for Publication -Text in figures and tables should be machine readable (i.e., -vector graphics containing font-based text should be used instead -Google Adwords -https://adwords.google.com/select/KeywordToolExternal; -Google keyword tool, http://google.com/sktool/ -Spacky, http://www.spacky.com -12 For example, keywords such as ‘Web’ and ‘HTML’ may be of -limited use because there are too many papers published in that -space, in which case it makes more sense to narrow the scope -and choose better-differentiated keywords. -of rasterized images) so that it can easily be indexed by academic -search engines. Vector graphics also look more professional, and -are more user friendly, than raster/bitmap graphics. Graphics -stored as JPEG, BMP, GIF, TIFF, or PNG files are not vector -graphics. -When documents are converted to PDF, all metadata should be -correct (especially author and title). Some search engines use -PDF metadata to identify the file or to display information about -the article on the search results page. It may also be beneficial to -give a meaningful file name to each article. -3.4 Publishing -As part of the optimization process, authors should consider the -journal’s or publisher’s policies. Open-access articles usually -receive more citations than articles accessible only by purchase -or subscription; and, obviously, only articles that are available on -the Web can be indexed by Web-based academic search engines. -Accordingly, when selecting a journal or publisher for -submission, authors should favor those that cooperate with -Google Scholar and other academic search engines, since the -article will potentially obtain more readers and receive more -citations. 13 If a journal does not publish online, authors should -favor publishers who at least allow authors to put their articles -on their or their institutions’ home pages. -3.5 Follow-Up -There are three ways to optimize articles for academic search -engines after publication. -The first is to publish the article on the author’s home page, so -that Web-based academic search engines can find and index it -even if the journal or publisher does not publish the article -online. An author who does not have a Web page might post -articles on an institutional Web page or upload it to a site such as -Sciplore.org, which offers researchers a personal publications -home page that is regularly crawled by Google Scholar (and, of -course, by SciPlore Search). However, it is important to -determine that posting or uploading the article does not -constitute a violation of the author’s agreement with the -publisher. -Second, an article that includes outdated words might be -replaced by either updating the existing article or publishing a -new version on the author’s home page. Google Scholar, at least, -considers all versions of an article available on the Web. We -consider this a good way of making older articles easier to find. -However, this practice may also violate your publisher’s -copyright policy, and it may also be considered misbehavior by -other researchers. It could also be a risky strategy: at some point -in the future, search engines may come to classify this practice as -spamming. In any case, updated articles should be clearly labeled -as such, so that readers are aware that they are reading a -modified version. -Third, it is important to create meaningful parent Web pages for -PDF files. This means that Web pages that link to the PDF file -should mention the most important keywords and the PDFs -13 The main criteria for selecting a publisher or journal, of -course, should still be its reputation and its general suitability -for the paper. The policy is to be seen as an additional factor. -metadata (title, author, and abstract). We do not know whether -any academic search engines are considering these data yet, but -normal search engines do consider them, and it seems only a -matter of time before academic search engines do, too. -4. DISCUSSION -As was true in the beginning for classic SEO, there are some -reservations about ASEO in the academic community. When we -submitted our study about Google Scholar’s ranking algorithm -[2] to a conference, it was rejected. One reviewer provided the -following feedback: -I’m not a big fan of this area of research [...]. I know it’s in -the call for papers, but I think that’s a mistake. -A second reviewer wrote, -[This] paper seems to encourage scientific paper authors to -learn Google scholar’s ranking method and write papers -accordingly to boost ranking [which is not] acceptable to -scientific communities which are supposed to advocate true -technical quality/impact instead of ranking. -ASEO should not be seen as a guide on how to cheat academic -search engines. Rather, it is about helping academic search -engines to understand the content of research papers and, thus, -about how to make this content more widely and easily available. -Certainly, we can anticipate that some researchers will try to -boost their rankings in illegitimate ways. However, the same -problem exists in regular Web searching; and eventually Web -search engines manage to avoid spam with considerable success, -and so will academic search engines. In the long term, ASEO -will be beneficial for all – authors, search engines, and users of -search engines. Therefore, we believe that academic search -engine optimization (ASEO) should be a common procedure for -researchers, similar to, for instance, selecting an appropriate -journal for publication. -ACKNOWLEDGEMENTS -We thank the SEO Bert van Heerde from Insyde -(http://www.insyde.nl/) for his valuable feedback, and Barbara -Shahin for proofreading this article. -ABOUT THE AUTHORS -The research career of JĂśran Beel and Bela Gipp began about ten -years ago when they won second prize in Jugend Forscht, -Germany’s largest and most reputable youth science competition -and received awards from, among others, German Chancellor -Gerhard SchrĂśder for their outstanding research work. In 2007, -they graduated with distinction at OVGLI Magdeburg, Germany, -in the field of computer science. They now work for the VLBA- -Lab and are PhD students, currently at LIC Berkeley as visiting -student researchers. During the past years they have published -several papers about academic search engines and research paper -recommender systems. -Erik Wilde is Adjunct Professor at the LIC Berkeley School of -Information. He began his work in Web technologies and Web -architectures a little over ten years ago by publishing the first -book providing a complete overview of Web technologies. After -focusing for some years on XML technologies, XML and -modelling, mapping issues between XML and non-tree -metamodels, and XML-centric design of applications and data -models, he has recently shifted his main focus to information and -application architecture, mobile applications, geo-location issues -on the Web, and how to design data sharing that is open and -accessible for many different service consumers. -REFERENCES diff --git a/bin/2010-ASEO--preprint.cite b/bin/2010-ASEO--preprint.cite deleted file mode 100644 index 53b4ae0..0000000 --- a/bin/2010-ASEO--preprint.cite +++ /dev/null @@ -1,133 +0,0 @@ -[1] JĂśran Beel and Bela Gipp. Google Scholar’s Ranking -Algorithm: The Impact of Citation Counts (An Empirical Study). -In AndrĂŠ Flory and Martine Collard, editors, Proceedings of the -3rd IEEE International Conference on Research Challenges in -]nfoO atYon ScY6nc6Fďż˝&]���1 ďż˝, pages 439–446, Fez (Morocco), -April 2009. IEEE. doi: 10.1109/RCIS.2009.5089308. ISBN 978- -1-4244-2865-6. Available on http://www.sciplore.org. -[2] JĂśran Beel and Bela Gipp. Google Scholar’s Ranking -Algorithm: An Introductory Overview. In Birger Larsen and -Jacqueline Leta, editors, Proceedings of the 12th International -&onf6O6nc6 oFHďż˝Y6ďż˝Q 6ďż˝OYHQnG ]ďż˝HO 6ďż˝OYV ďż˝]��]��1 ďż˝, -volume 1, pages 230–241, Rio de Janeiro (Brazil), July 2009. -International Society for Scientometrics and Informetri cs. ISSN -2175-1935. Available on http://www.sciplore.org. -[3] JĂśran Beel and Bela Gipp. Google Scholar’s Ranking -Algorithm: The Impact of Articles’ Age (An Empirical Study). In -Shahram Latifi, editor, Proceedings of the 6th International -Conference on Information Technology: New Generations -ďż˝]ďż˝71*��1 ďż˝, pages 160–164, Las Vegas (USA), April 2009. IEEE. -doi: 10.1109/ITNG.2009.317. ISBN 978-1424437702. Available -on http://www.sciplore.org. -[4] Google. Google’s Search Engine Optimization Starter Guide. -PDF, November 2008. URL http://www.google.com/- -webmasters/docs/search-engine-optimization-starter-guide. pdf. -[5] Albert Bifet and Carlos Castillo. An Analysis of Factors Used -in Search Engine Ranking. In Proceedings of the 14th -International World Wide Web Conference (WWW2005), First -International Workshop on Adversarial Information Retrieval on -t 56 11 6b ��]ďż˝11 6�����, 2005. -http://airweb.cse.lehigh.edu/2005/bifet.pdf. -[6] Michael P. Evans. Analysing Google rankings through search -engine optimization data. Internet Research, 17 (1): 21–37, 2007. -doi: 10.1108/10662240710730470. -[7] Jin Zhang and Alexandra Dimitroff. The impact of metadata -implementation on webpage visibility in search engine results -(Part II). Cross-Language Information Retrieval, 41 (3): 691– -715, May 2005. -[8] Harold Davis. Search Engine Optimization. O’Reilly, 2006. -[9] Jennifer Grappone and Gradiva Couzin. Search Engine -Optimization: An Hour a Day. John Wiley and Sons, 2nd edition, -2008. -[10] Peter Kent. Search engine optimization for dummies. Willey -Publishing Inc, 2006. -[11] AA Benczur, K CsalogĂĄny, T SarlĂłs, and M Uher. -SpamRank – Fully Automatic Link Spam Detection. In -AGv6OsaOYal ]nDO atYoRR6tOY6vaRQ❑ 56 11 6b ��Yďż˝11 ���Qďż˝, -2005. -[12] A. BenczĂşr, K. CsalogĂĄny, and T. SarlĂłs. Link-based -similarity search to fight web spam. Adversarial Information -Retrieval on the Web (AIR WEB), Seattle, Washington, USA, -2006. -[13] I. Drost and T. Scheffer. Thwarting the nigritude -ultramarine: Learning to identify link spam. Lecture Notes in -Computer Science, 3720: 96, 2005. -[14] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, -and statistics: Using statistical analysis to locate spam web -pages. pages 1–6, 2004. -[ 15] Q. Gan and T. Suel . Improving web spam classifiers using -link structure. In Proceedings of the 3rd international workshop -on Adversarial information retrieval on the web, page 20. ACM, -2007. -[ 16] Z. GyĂśngyi and H. Garcia-Molina. Link spam alliances. In -Proceedings of the 31st international conference on Very large -data bases, page 528. VLDB Endowment, 2005. -[17] H. Saito, M. Toyoda, M. Kitsuregawa, and K. Aihara. A -large-scale study of link spam detection by graph algorithms. In -Proceedings of the 3rd international workshop on Adversarial -information retrieval on the web, page 48. ACM, 2007. -[ 18] B. Wu and K. Chel lapilla. Extracting link spam using biased -random walks from spam seed sets. In Proceedings of the 3rd -international workshop on Adversarial information retrieval on -the web, page 44. ACM, 2007. -[19] C. Castillo, D. Donato, A. Gionis, V. Murdock, and -F. Silvestri. Know your neighbors: Web spam detection using the -web topology. In Proceedings of the 30th annual international -ACM SIGIR conference on Research and development in -information retrieval, page 430. ACM, 2007. -[20] G.G. Geng, C.H. Wang, and Q.D. Li. Improving -Spamdexing Detection Via a Two-Stage Classification Strategy. -page 356, 2008. -[21] I.S. Nathenson. Internet infoglut and invisible ink: -Spamdexing search engines with meta tags. Harv. J. Law & Tec, -12: 43–683, 1998. -[22] T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. -Tracking web spam with HTM L style similarities. ACM -Transactions on the Web (TWEB), 2, 2008. -[23] T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam -with hidden style similarity. In AIRWeb 2006, page 25, 2006. -[24] Masahiro Kimura, Kazumi Saito, Kazuhiro Kazama, and -Shin ya Sato. Detecting Search Engine Spam from a Trackback -Network in Blogspace. Lecture Notes in Computer Science: -Knowledge-Based Intelligent Information and Engineering -Systems, 3684: 723–729, 2005. doi: 10.1007/11554028_101. -[25] Alexandros Ntoulas, Marc Najork, Mark Manasse, and -Dennis Fetterly. Detecting spam web pages through content -analysis. In 15th International Conference on World Wide Web, -pages 83–92. ACM, 2006. -[26] Baoning Wu and Brian D. Davison. Identifying link farm -spam pages. In 14th International Conference on World Wide -Web, pages 820–829, 2005. -[27] Yahoo! How do I improve the ranking of my web site in the -search results?, July 2007. URL http://help.yahoo.com/l/us/- -yahoo/search/indexing/ranking-02.html. -[28] Alex Chitu. Google’s Market Share in Your Country. -Website, March 2009. URL http://googlesystem.blogspot.com/- -2009/03/googles-market-share-in-your-country.html https://- -spreadsheets.google.com/- -ccc?key=pLaE9tsVLp_0y1 FKWBCKGBA. -[29] D. Lewandowski and P. Mayr. Exploring the academic -invisible web. Library Hi Tech, 24 (4) : 529–539, 2006. -[30] Nisa Bakkalbasi, Kathleen Bauer, Janis Glover, and Lei -Wang. Three options for citation tracking: Google Scholar, -Scopus and Web of Science. Biomedical Digital Libraries, 3, -2006. doi : 10.1186/1742-5581-3-7. -[31] John J. Meier and Thomas W. Conkling. Google Scholar’s -Coverage of the Engineering Literature: An Empirical Study. The -Journal of Academic Librarianship, 34 (34): 196–201, 2008. -[32] William H. Walters. Google Scholar coverage of a -multidisciplinary field. Information Processing & Management, -43 (4) : 1121–1132, July 2007. doi : -doi :10.1016/j . ipm.2006.08.006. -[33] Google. About Google Scholar. Website, 2008. URL http://- -scholar.google.com/intl/en/scholar/about.html. -[34] Bert van Heerde. RE: Pre-print: Academic Search Engine -Optimization. Email, 3 September 2009. -[35] Google Scholar. Support for Scholarly Publishers. Website, -2009. URL http://scholar.google.com/intl/en/scholar/- -publishers.html. -[36] S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 -extension to multiple weighted fields. In Proceedings of the -thirteenth ACM international conference on Information and -knowledge management, pages 42–49. ACM New York, NY, -USA, 2004. \ No newline at end of file diff --git a/bin/2010-ASEO--preprint.out b/bin/2010-ASEO--preprint.out deleted file mode 100644 index 3a93038..0000000 --- a/bin/2010-ASEO--preprint.out +++ /dev/null @@ -1,591 +0,0 @@ - - - - -Preprint of: JĂśran Beel, Bela Gipp, and Erik Wilde. Academic Search Engine Optimization (ASEO): Optimizing Scholarly Literature for Google Scholar and Co. Journal of Scholarly Publishing, 41 (2): 176–190, January 2010. doi: 10.3138/jsp.41.2.176. University of Toronto Press. Downloaded from -http://www.sciplore.org -Academic Search Engine Optimization (ASEO): Optimizing Scholarly Literature for Google Scholar &amp; Co -DĂśran Beel -Otto-von-Guericke University FIN / ITI / VLBA-Lab -
Germany
-beel@sciplore.org -Bela Gipp -Otto-von-Guericke University FIN / ITI / VLBA-Lab -
Germany
-gipp@sciplore.org -Erik Wilde -UC Berkeley School of Information United States -dret@berkeley.edu -This article introduces and discusses the concept of academic search engine optimization (ASEO). Based on three recently conducted studies, guidelines are provided on how to optimize scholarly literature for academic search engines in general and for Google Scholar in particular. In addition, we briefly discuss the risk of researchers’ illegitimately ‘over-optimizing’ their articles -Keywords academic search engines, academic search engine optimization, ASEO, Google Scholar, ranking algorithm, search engine -optimization, SEO -
-
- - - - -Jöran Beel -Bela Gipp - -Google Scholar’s Ranking Algorithm: The Impact of Citation Counts (An Empirical Study) -2009 -In André Flory and Martine Collard, editors, Proceedings of the 3rd IEEE International Conference on Research Challenges in ]nfoO atYon ScY6nc6F�&amp;]���1 -439--446 -Fez (Morocco) -Available on http://www.sciplore.org - -er how publishers or providers of academic repositories can optimize their Web sites and repositories for academic search engines. The guidelines are based on three studies we have recently conducted [1, 2, 3] and on our experience in developing the academic search engine Sci Plore.org. 1 In this article we do not distinguish between ‘academic databases’ and ‘academic search engines’; the latter term is us - had 552, articles at position 3 had 426, and articles at position 1000 had fifty-three. The study was based on 1,032,766 results produced by 1050 search queries in November 2008. For more detail see [1]. Figure 2: Example of a Bitmap Graphic 2.3.3 Author and Publication Name If the search query includes an author or publication name, a document in which either appears is likely to be ranked high. Fo - -[1] -Jöran Beel and Bela Gipp. Google Scholar’s Ranking Algorithm: The Impact of Citation Counts (An Empirical Study). In André Flory and Martine Collard, editors, Proceedings of the 3rd IEEE International Conference on Research Challenges in ]nfoO atYon ScY6nc6F�&amp;]���1 �, pages 439–446, Fez (Morocco), April 2009. IEEE. doi: 10.1109/RCIS.2009.5089308. ISBN 978-1-4244-2865-6. Available on http://www.sciplore.org. - - - -Jöran Beel -Bela Gipp - -Google Scholar’s Ranking Algorithm: An Introductory Overview -2009 -Proceedings of the 12th International &amp;onf6O6nc6 oFH�Y6�Q 6�OYHQnG ]�HO 6�OYV �]��]��1 -1 -230--241 -In Birger Larsen and Jacqueline Leta, editors -Available on http://www.sciplore.org - -er how publishers or providers of academic repositories can optimize their Web sites and repositories for academic search engines. The guidelines are based on three studies we have recently conducted [1, 2, 3] and on our experience in developing the academic search engine Sci Plore.org. 1 In this article we do not distinguish between ‘academic databases’ and ‘academic search engines’; the latter term is us - d es about vari ous (medical) topics from the journal Arteriosclerosis, Thrombosis, and V ascul ar Bi of ogy, many of whi ch di d not i nd ude the search term either in the title or in the full text [2]. Figure 3: Mean Citation Count per Position8 2.3.4 Other factors Google Scholar’s standard search does not consider publication dates. However, Google Scholar offers a special search function for ‘re -too. 4. DISCUSSION As was true in the beginning for classic SEO, there are some reservations about ASEO in the academic community. When we submitted our study about Google Scholar’s ranking algorithm [2] to a conference, it was rejected. One reviewer provided the following feedback: I’m not a big fan of this area of research [...]. I know it’s in the call for papers, but I think that’s a mistake. A s - -[2] -Jöran Beel and Bela Gipp. Google Scholar’s Ranking Algorithm: An Introductory Overview. In Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International &amp;onf6O6nc6 oFH�Y6�Q 6�OYHQnG ]�HO 6�OYV �]��]��1 �, volume 1, pages 230–241, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetri cs. ISSN 2175-1935. Available on http://www.sciplore.org. - - - -Jöran Beel -Bela Gipp - -Google Scholar’s Ranking Algorithm: The Impact of Articles’ Age (An Empirical Study) -2009 -In Shahram Latifi, editor, Proceedings of the 6th International Conference on Information Technology: New Generations �]�71*��1 -160--164 -Las Vegas (USA) - -er how publishers or providers of academic repositories can optimize their Web sites and repositories for academic search engines. The guidelines are based on three studies we have recently conducted [1, 2, 3] and on our experience in developing the academic search engine Sci Plore.org. 1 In this article we do not distinguish between ‘academic databases’ and ‘academic search engines’; the latter term is us - -[3] -Jöran Beel and Bela Gipp. Google Scholar’s Ranking Algorithm: The Impact of Articles’ Age (An Empirical Study). In Shahram Latifi, editor, Proceedings of the 6th International Conference on Information Technology: New Generations �]�71*��1 �, pages 160–164, Las Vegas (USA), April 2009. IEEE. doi: 10.1109/ITNG.2009.317. ISBN 978-1424437702. Available on http://www.sciplore.org. - - - -Google - -Google’s Search Engine Optimization Starter Guide -2008 -PDF -URL http://www.google.com/-webmasters/docs/search-engine-optimization-starter-guide. pdf - -h engine optimization (SEO) for Web sites is a common procedure. SEO involves creating or modifying a Web site in a way that makes it ‘easier for search engines to both crawl and index [its] content’ [4]. There exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of S -earch engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines themselves publish guidelines on how to optimize Web sites for search engines [4, 27]. But similar information on optimizing scholarly literature for academic search engines does not exist, to our knowledge.3 2.1 Introduction to Academic Search Engine Optimization (ASEO) Based on the - -[4] -Google. Google’s Search Engine Optimization Starter Guide. PDF, November 2008. URL http://www.google.com/-webmasters/docs/search-engine-optimization-starter-guide. pdf. - - - -Albert Bifet -Carlos Castillo - -An Analysis of Factors Used in Search Engine Ranking -2005 -In Proceedings of the 14th International World Wide Web Conference (WWW2005), First International Workshop on Adversarial Information Retrieval on t -56 -http://airweb.cse.lehigh.edu/2005/bifet.pdf - -here exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, - -[5] -Albert Bifet and Carlos Castillo. An Analysis of Factors Used in Search Engine Ranking. In Proceedings of the 14th International World Wide Web Conference (WWW2005), First International Workshop on Adversarial Information Retrieval on t 56 11 6b ��]�11 6�����, 2005. http://airweb.cse.lehigh.edu/2005/bifet.pdf. - - - -Michael P Evans - -Analysing Google rankings through search engine optimization data -2007 -Internet Research -17 -21--37 - -here exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, - -[6] -Michael P. Evans. Analysing Google rankings through search engine optimization data. Internet Research, 17 (1): 21–37, 2007. doi: 10.1108/10662240710730470. - - - -Jin Zhang -Alexandra Dimitroff - -The impact of metadata implementation on webpage visibility in search engine results (Part II) -2005 -Cross-Language Information Retrieval -41 -691--715 - -here exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, - -[7] -Jin Zhang and Alexandra Dimitroff. The impact of metadata implementation on webpage visibility in search engine results (Part II). Cross-Language Information Retrieval, 41 (3): 691– 715, May 2005. - - - -Harold Davis - -Search Engine Optimization -2006 -O’Reilly - -here exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, - -[8] -Harold Davis. Search Engine Optimization. O’Reilly, 2006. - - - -Jennifer Grappone -Gradiva Couzin - -Search Engine Optimization: An Hour a Day -2008 -John Wiley -and Sons, 2nd edition - -here exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, - -[9] -Jennifer Grappone and Gradiva Couzin. Search Engine Optimization: An Hour a Day. John Wiley and Sons, 2nd edition, 2008. - - - -Peter Kent - -Search engine optimization for dummies -2006 -Willey Publishing Inc - -here exists a huge community that discusses the latest trends in SEO and provides advice for Webmasters in forums, blogs, and newsgroups.2 Even research articles and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, - -[10] -Peter Kent. Search engine optimization for dummies. Willey Publishing Inc, 2006. - - - -AA Benczur -K Csalogány -T Sarlós -M Uher - -SpamRank – Fully Automatic Link Spam Detection -2005 -In AGv6OsaOYal ]nDO atYoRR6tOY6vaRQ❑ 56 11 6b ��Y�11 ���Q - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[11] -AA Benczur, K Csalogány, T Sarlós, and M Uher. SpamRank – Fully Automatic Link Spam Detection. In AGv6OsaOYal ]nDO atYoRR6tOY6vaRQ❑ 56 11 6b ��Y�11 ���Q�, 2005. - - - -A Benczúr -K Csalogány -T Sarlós - -Link-based similarity search to fight web spam -2006 -Adversarial Information Retrieval on the Web (AIR WEB) -Seattle, Washington, USA - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[12] -A. Benczúr, K. Csalogány, and T. Sarlós. Link-based similarity search to fight web spam. Adversarial Information Retrieval on the Web (AIR WEB), Seattle, Washington, USA, 2006. - - - -I Drost -T Scheffer - -Thwarting the nigritude ultramarine: Learning to identify link spam -2005 -Lecture Notes in Computer Science -3720 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[13] -I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. Lecture Notes in Computer Science, 3720: 96, 2005. - - - -D Fetterly -M Manasse -M Najork - -Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages -2004 -1--6 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[14] -D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. pages 1–6, 2004. - - - -Q Gan -T Suel - -Improving web spam classifiers using link structure -2007 -In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web -20 -ACM -[ 15] -Q. Gan and T. Suel . Improving web spam classifiers using link structure. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, page 20. ACM, 2007. - - - -Z Gyöngyi -H Garcia-Molina - -Link spam alliances -2005 -In Proceedings of the 31st international conference on Very large data bases -528 -VLDB Endowment -[ 16] -Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st international conference on Very large data bases, page 528. VLDB Endowment, 2005. - - - -H Saito -M Toyoda -M Kitsuregawa -K Aihara - -A large-scale study of link spam detection by graph algorithms -2007 -In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web -48 -ACM - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[17] -H. Saito, M. Toyoda, M. Kitsuregawa, and K. Aihara. A large-scale study of link spam detection by graph algorithms. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, page 48. ACM, 2007. - - - -B Wu -K - -Chel lapilla. Extracting link spam using biased random walks from spam seed sets -2007 -In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web -44 -ACM -[ 18] -B. Wu and K. Chel lapilla. Extracting link spam using biased random walks from spam seed sets. In Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, page 44. ACM, 2007. - - - -C Castillo -D Donato -A Gionis -V Murdock -F Silvestri - -Know your neighbors: Web spam detection using the web topology -2007 -In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval -430 -ACM - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[19] -C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, page 430. ACM, 2007. - - - -G G Geng -C H Wang -Q D Li - -Improving Spamdexing Detection Via a Two-Stage Classification Strategy -2008 -356 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[20] -G.G. Geng, C.H. Wang, and Q.D. Li. Improving Spamdexing Detection Via a Two-Stage Classification Strategy. page 356, 2008. - - - -I S Nathenson - -Internet infoglut and invisible ink: Spamdexing search engines with meta tags -1998 -Harv. J. Law &amp; Tec -12 -43--683 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[21] -I.S. Nathenson. Internet infoglut and invisible ink: Spamdexing search engines with meta tags. Harv. J. Law &amp; Tec, 12: 43–683, 1998. - - - -T Urvoy -E Chauveau -P Filoche -T Lavergne - -Tracking web spam with HTM L style similarities -2008 -ACM Transactions on the Web (TWEB) -2 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[22] -T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with HTM L style similarities. ACM Transactions on the Web (TWEB), 2, 2008. - - - -T Urvoy -T Lavergne -P Filoche - -Tracking web spam with hidden style similarity -2006 -In AIRWeb 2006 -25 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[23] -T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam with hidden style similarity. In AIRWeb 2006, page 25, 2006. - - - -Masahiro Kimura -Kazumi Saito -Kazuhiro Kazama -Shin ya Sato - -Detecting Search Engine Spam from a Trackback Network in Blogspace -2005 -Lecture Notes in Computer Science: Knowledge-Based Intelligent Information and Engineering Systems -3684 -723--729 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[24] -Masahiro Kimura, Kazumi Saito, Kazuhiro Kazama, and Shin ya Sato. Detecting Search Engine Spam from a Trackback Network in Blogspace. Lecture Notes in Computer Science: Knowledge-Based Intelligent Information and Engineering Systems, 3684: 723–729, 2005. doi: 10.1007/11554028_101. - - - -Alexandros Ntoulas -Marc Najork -Mark Manasse -Dennis Fetterly - -Detecting spam web pages through content analysis -2006 -In 15th International Conference on World Wide Web -83--92 -ACM - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[25] -Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. Detecting spam web pages through content analysis. In 15th International Conference on World Wide Web, pages 83–92. ACM, 2006. - - - -Baoning Wu -Brian D Davison - -Identifying link farm spam pages -2005 -In 14th International Conference on World Wide Web -820--829 - -es and books exist on the subject of SEO [5, 6, 7, 8, 9, 10]. When SEO began, many expressed their concerns that it would promote spam and tweaking, and, indeed, search-engine spam is a serious issue [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]. Today, however, SEO is a common and widely accepted procedure and overall, search engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines - -[26] -Baoning Wu and Brian D. Davison. Identifying link farm spam pages. In 14th International Conference on World Wide Web, pages 820–829, 2005. - - - -Yahoo - -How do I improve the ranking of my web site in the search results -2007 -URL http://help.yahoo.com/l/us/-yahoo/search/indexing/ranking-02.html - -earch engines manage to identify spam quite well. Probably the strongest argument for SEO is the fact that search engines themselves publish guidelines on how to optimize Web sites for search engines [4, 27]. But similar information on optimizing scholarly literature for academic search engines does not exist, to our knowledge.3 2.1 Introduction to Academic Search Engine Optimization (ASEO) Based on the - -[27] -Yahoo! How do I improve the ranking of my web site in the search results?, July 2007. URL http://help.yahoo.com/l/us/-yahoo/search/indexing/ranking-02.html. - - - -Alex Chitu - -Google’s Market Share in Your Country -2009 -Website -URL http://googlesystem.blogspot.com/-2009/03/googles-market-share-in-your-country.html https://-spreadsheets.google.com/-ccc?key=pLaE9tsVLp_0y1 FKWBCKGBA - -it easier for academic search engines to both crawl it and index it. ASEO differs from SEO in four significant respects. First, for Web search, Google is the market leader in most (Western) countries [28]. This means that for Webmasters (focusing on Western Internet users), it is generally sufficient to optimize their Web sites for Google. In contrast, no such market leader exists 2 E.g. http://www.ab - -[28] -Alex Chitu. Google’s Market Share in Your Country. Website, March 2009. URL http://googlesystem.blogspot.com/-2009/03/googles-market-share-in-your-country.html https://-spreadsheets.google.com/-ccc?key=pLaE9tsVLp_0y1 FKWBCKGBA. - - - -D Lewandowski -P Mayr - -Exploring the academic invisible web -2006 -Library Hi Tech -24 -529--539 - -s available on the Web and accessible to Web-based academic search engines such as CiteSeer. Most academic articles are stored in publishers’ databases; they are part of the ‘academic invisible web,’ [29] and (academic) search engines usually cannot access and index these articles. A few academic search engines, such as Scirus and Google Scholar, cooperate with publishers, but still they do not cover - -[29] -D. Lewandowski and P. Mayr. Exploring the academic invisible web. Library Hi Tech, 24 (4) : 529–539, 2006. - - - -Nisa Bakkalbasi -Kathleen Bauer -Janis Glover -Lei Wang - -Three options for citation tracking -2006 -Google Scholar, Scopus and Web of Science. Biomedical Digital Libraries -3 -10--1186 - - engines usually cannot access and index these articles. A few academic search engines, such as Scirus and Google Scholar, cooperate with publishers, but still they do not cover all existing articles [30, 31, 32]. Researchers therefore need to think seriously about how to get their articles indexed by academic search engines. Third, Webmasters can alter their pages by adding or replacing words and links, dele - -[30] -Nisa Bakkalbasi, Kathleen Bauer, Janis Glover, and Lei Wang. Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical Digital Libraries, 3, 2006. doi : 10.1186/1742-5581-3-7. - - - -John J Meier -Thomas W Conkling - -Google Scholar’s Coverage of the Engineering Literature: An Empirical Study -2008 -The Journal of Academic Librarianship -34 -196--201 - - engines usually cannot access and index these articles. A few academic search engines, such as Scirus and Google Scholar, cooperate with publishers, but still they do not cover all existing articles [30, 31, 32]. Researchers therefore need to think seriously about how to get their articles indexed by academic search engines. Third, Webmasters can alter their pages by adding or replacing words and links, dele - -[31] -John J. Meier and Thomas W. Conkling. Google Scholar’s Coverage of the Engineering Literature: An Empirical Study. The Journal of Academic Librarianship, 34 (34): 196–201, 2008. - - - -William H Walters - -Google Scholar coverage of a multidisciplinary field. Information Processing &amp -2007 -Management -doi : doi :10.1016/j . ipm.2006.08.006 -43 -1121--1132 - - engines usually cannot access and index these articles. A few academic search engines, such as Scirus and Google Scholar, cooperate with publishers, but still they do not cover all existing articles [30, 31, 32]. Researchers therefore need to think seriously about how to get their articles indexed by academic search engines. Third, Webmasters can alter their pages by adding or replacing words and links, dele - -[32] -William H. Walters. Google Scholar coverage of a multidisciplinary field. Information Processing &amp; Management, 43 (4) : 1121–1132, July 2007. doi : doi :10.1016/j . ipm.2006.08.006. - - - -Google - -About Google Scholar -2008 -Website -URL http://-scholar.google.com/intl/en/scholar/about.html - -ial search function for ‘recent articles,’ which limits results to articles published within the past five years. Furthermore, Google Scholar claims to consider both publication and author reputation [33]. However, we could not research the influence of these factors because of a lack of data, and therefore we do not consider them here. 2.3. 5 Sources Indexed by Google Scholar Bert van Heerde, a profe - -[33] -Google. About Google Scholar. Website, 2008. URL http://-scholar.google.com/intl/en/scholar/about.html. - - - -RE - -Pre-print: Academic Search Engine Optimization -2009 -Email -3 - -e term ‘invitation based search engine’ to describe Google Scholar: Only articles from trusted sources and articles that are ‘invited’ (cited) by articles already indexed are included in the database [34]. ‘Trusted sources,’ in this case, are publishers that cooperate directly with Google Scholar, as well as publishers and Webmasters who have requested that Google Scholar crawl their databases and Web - -[34] -Bert van Heerde. RE: Pre-print: Academic Search Engine Optimization. Email, 3 September 2009. - - - -Google Scholar - -Support for Scholarly -2009 -Publishers. Website -URL http://scholar.google.com/intl/en/scholar/-publishers.html - - http://www.seo.com/blog http://www.abakus-internet-marketing.de/seoblog 3 Google Scholar offers some information for publishers on how to get their articles indexed by Google Scholar and ranked well [35]. However, this information is superficial in comparison to other SEO articles, and the information is not aimed at authors. for searching academic articles, and researchers would need to optimize the -ee Figure 4 for an illustrative example). Figure 4: Linking database entries with external PDFs If different PDF files of an article exist, Google Scholar groups them to improve the article’s ranking [35]. For instance, if a preprint version of an article is available on the author’s Web page and the final version is available on the publisher’s site, Google indexes both as one version. If the two ver - -[35] -Google Scholar. Support for Scholarly Publishers. Website, 2009. URL http://scholar.google.com/intl/en/scholar/-publishers.html. - - - -S Robertson -H Zaragoza -M Taylor - -Simple BM25 extension to multiple weighted fields -2004 -In Proceedings of the thirteenth ACM international conference on Information and knowledge management -42--49 -ACM -New York, NY, USA - -is in which the term occurs, the more relevant the document is considered4. This means that an occurrence in the 4 Some algorithms, such as the BM25(f ), saturate when a word occurs often in the text [36]. title is weighted more heavily than an occurrence in the abstract, which carries more weight than an occurrence in a (sub)heading, than in the body text, and so on. Possible document fields that may - -[36] -S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42–49. ACM New York, NY, USA, 2004. - - - -
\ No newline at end of file diff --git a/bin/2010-ASEO--preprint.pdf.xml b/bin/2010-ASEO--preprint.pdf.xml deleted file mode 100644 index 366c807..0000000 --- a/bin/2010-ASEO--preprint.pdf.xml +++ /dev/null @@ -1,12737 +0,0 @@ - - - - - - - -en - - - - - - - - - - -
- - - - - - - - - -Preprint - -of: - - - -Jöran - -Beel, - -Bela - -Gipp, - -and - -Erik - -Wilde. - -Academic - -Search - -Engine - -Optimization - -(ASEO): - -Optimizing - -Scholarly - -Literature - -for - -Google - -Scholar - -and - - - - -Co. - -Journal - -of - -Scholarly - -Publishing, - -41 - -(2): - -176–190, - -January - -2010. - -doi: - -10.3138/jsp.41.2.176. - -University - -of - -Toronto - -Press. - -Downloaded - -from - - - -http://www.sciplore.org - - - - - - - - -Academic - -Search - -Engine - -Optimization - -(ASEO): - -Optimizing - - - - -Scholarly - -Literature - -for - -Google - -Scholar - -& - -Co. - - - -
-
- - - -DĂśran - -Beel - - - - -Otto-von-Guericke - -University - - -FIN - -/ - -ITI - -/ - -VLBA-Lab - - -Germany - - - - -beel@sciplore.org - - - - - - -Bela - -Gipp - - - - -Otto-von-Guericke - -University - - -FIN - -/ - -ITI - -/ - -VLBA-Lab - - -Germany - - - - -gipp@sciplore.org - - - - - - -Erik - -Wilde - - - - -UC - -Berkeley - - -School - -of - -Information - - -United - -States - - - - -dret@berkeley.edu - - - -
-
- - - -ABSTRACT - - - - -This - -article - -introduces - -and - -discusses - -the - -concept - -of - -academic - - - -search - -engine - -optimization - -(ASEO). - -Based - -on - -three - -recently - - - -conducted - -studies, - -guidelines - -are - -provided - -on - -how - -to - -optimize - - - -scholarly - -literature - -for - -academic - -search - -engines - -in - -general - -and - - - -for - -Google - -Scholar - -in - -particular. - -In - -addition, - -we - -briefly - -discuss - - - - -the - -risk - -of - -researchers’ - -illegitimately - - - -‘over - -- - -optimizing’ - - - -their - - - - -articles. - - - - -Keywords - - - - - - - - -academic - -search - -engines, - -academic - -search - -engine - -optimization, - - - - - - -ASEO, - -Google - -Scholar, - -ranking - -algorithm, - -search - -engine - - - - -optimization, - -SEO - - - - - - - - -1. - -INTRODUCTION - - - - -Researchers - -should - -have - -an - -interest - -in - -ensuring - -that - -their - -articles - - - - -are - -indexed - -by - -academic - -search - - - -engines - -1 - - - -such - -as - -Google - -Scholar, - - - - -IEEE - -Xplore, - -PubMed, - -and - -SciPlore.org, - -which - -greatly - -improves - - - -their - -ability - -to - -make - -their - -articles - -available - -to - -the - -academic - - - -community. - -Not - -only - -should - -authors - -take - -an - -interest - -in - -seeing - - - - -that - - - -their - -articles - -are - -indexed, - -they - -also - -should - -be - -interesting - -in - - - - -where - - -the - -articles - -are - -displayed - -in - -the - -results - -list. - -Like - -any - -other - - - - -type - -of - -ranked - -search - -results, - -articles - -displayed - -in - -top - -positions - - - -are - -more - -likely - -to - -be - -read. - - - - - -This - -article - -presents - -the - -concept - -of - - - -academic - -search - -engine - - - - -optimization - - -(ASEO) - -to - -optimize - -scholarly - -literature - -for - - - - -academic - -search - -engines. - -The - -first - -part - -of - -the - -article - -covers - - - -related - -work - -that - -has - -been - -done - -mostly - -in - -the - -field - -of - -general - - - -search - -engine - -optimization - -for - -Web - -pages. - -The - -second - -part - - - -defines - -ASEO - -and - -compares - -it - -to - -search - -engine - -optimization - -for - - - -Web - -pages. - -The - -third - -part - -provides - -an - -overview - -of - -ranking - - - -algorithms - -of - -academic - -search - -engines - -in - -general, - -followed - -by - -an - - - -overview - -of - -Google - -Scholar’s - -ranking - -algorithm. - -Finally, - - - -guidelines - -are - -provided - -on - -how - -authors - -can - -optimize - -their - - - -articles - -for - -academic - -search - -engines. - -This - -article - -does - -not - -cover - - - -how - -publishers - -or - -providers - -of - -academic - -repositories - -can - - - -optimize - -their - -Web - -sites - -and - -repositories - -for - -academic - -search - - - -engines. - -The - -guidelines - -are - -based - -on - -three - -studies - -we - -have - - - -recently - -conducted - -[1-3] - -and - -on - -our - -experience - -in - -developing - -the - - - -academic - -search - -engine - -Sci - -Plore.org. - - - - - -1 - - -In - -this - -article - -we - -do - -not - -distinguish - -between - -‘academic - - - - -databases’ - -and - -‘academic - -search - -engines’; - -the - -latter - -term - -is - - - -used - -as - -synonym - -for - -both. - - - - - - -2. - -RELATED - -WORK - - - - -On - -the - -Web, - -search - -engine - -optimization - -(SEO) - -for - -Web - -sites - -is - -a - - - -common - -procedure. - -SEO - -involves - -creating - -or - -modifying - -a - -Web - - - - -site - -in - -a - -way - -that - -makes - -it - -‘easier - -for - -search - -engines - -to - - - -bot - -h - - - - - - - -crawl - -and - -index - -[its] - - - -content - -’ - - - -[4]. - -There - -exists - -a - -huge - -community - - - - -that - -discusses - -the - -latest - -trends - -in - -SEO - -and - -provides - -advice - -for - - - - -Webmasters - -in - -forums, - -blogs, - -and - - - -newsgroups. - -2 - - - -Even - -research - - - - -articles - -and - -books - -exist - -on - -the - -subject - -of - -SEO - -[5-10]. - -When - -SEO - - - -began, - -many - -expressed - -their - -concerns - -that - -it - -would - -promote - - - -spam - -and - -tweaking, - -and, - -indeed, - -search-engine - -spam - -is - -a - -serious - - - -issue - -[11-26]. - -Today, - -however, - -SEO - -is - -a - -common - -and - -widely - - - -accepted - -procedure - -and - -overall, - -search - -engines - -manage - -to - - - -identify - -spam - -quite - -well. - -Probably - -the - -strongest - -argument - -for - - - -SEO - -is - -the - -fact - -that - -search - -engines - -themselves - -publish - -guidelines - - - -on - -how - -to - -optimize - -Web - -sites - -for - -search - -engines - -[4, - -27]. - -But - - - -similar - -information - -on - -optimizing - -scholarly - -literature - -for - - - - -academic - -search - -engines - -does - -not - -exist, - -to - -our - - - -knowledge. - -3 - - - - - - - - -2.1 - -Introduction - -to - -Academic - -Search - -Engine - - - -Optimization - -(ASEO) - - - - - -Based - -on - -the - -definition - -of - - - -search - -engine - -optimization - - - -for - -Web - - - - - -pages - -(SEO), - -we - -define - - - -academic - -search - -engine - -optimization - - - - - - -(ASEO) - -as - -follows: - - - - -Academic - -search - -engine - -optimization - -(ASEO) - -is - -the - -creation, - - - -publication, - -and - -modification - -of - -scholarly - -literature - -in - -a - - - -way - -that - -makes - -it - -easier - -for - -academic - -search - -engines - -to - -both - - - -crawl - -it - -and - -index - -it. - - - - -ASEO - -differs - -from - -SEO - -in - -four - -significant - -respects. - -First, - -for - - - -Web - -search, - -Google - -is - -the - -market - -leader - -in - -most - -(Western) - - - -countries - -[28]. - -This - -means - -that - -for - -Webmasters - -(focusing - -on - - - -Western - -Internet - -users), - -it - -is - -generally - -sufficient - -to - -optimize - -their - - - -Web - -sites - -for - -Google. - -In - -contrast, - -no - -such - -market - -leader - -exists - - - - - -2 - - -E.g. - -http://www.abakus-internet-marketing.de/foren - - - - -http://www.highrankings.com/forum - - - -http://www.seo-guy.com/forum - - - - -http://www.seomoz.org/blog - - - - -http://www.seo.com/blog - - - -http://www.abakus-internet-marketing.de/seoblog - - - - -3 - - -Google - -Scholar - -offers - -some - -information - -for - -publishers - -on - -how - - - - -to - -get - -their - -articles - -indexed - -by - -Google - -Scholar - -and - -ranked - -well - - - -[35]. - -However, - -this - -information - -is - -superficial - -in - -comparison - -to - - - -other - -SEO - -articles, - -and - -the - -information - -is - -not - -aimed - -at - -authors. - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - -for - -searching - -academic - -articles, - -and - -researchers - -would - -need - -to - - - -optimize - -their - -articles - -for - -several - -academic - -search - -engines. - -If - - - -these - -search - -engines - -are - -based - -on - -different - -crawling - -and - -ranking - - - -methods, - -optimization - -can - -become - -complicated. - - - - -Second, - -Webmasters - -usually - -do - -not - -need - -to - -worry - -about - -whether - - - -their - -site - -is - -indexed - -by - -a - -search - -engine: - -as - -long - -as - -any - -Web - -page - - - -is - -linked - -to - -an - -already - -indexed - -page, - -it - -will - -be - -crawled - -and - - - -indexed - -by - -Web - -search - -engines - -at - -some - -point. - -The - -situation - -is - - - -different - -in - -academia, - -where - -only - -a - -fraction - -of - -all - -published - - - -material - -is - -available - -on - -the - -Web - -and - -accessible - -to - -Web-based - - - -academic - -search - -engines - -such - -as - -CiteSeer. - -Most - -academic - - - - -articles - -are - -stored - -in - -publishers’ - - - -databases; - -they - -are - -part - -of - -the - - - - - -‘academic - -invisible - -web,’ - - - -[29] - -and - -(academic) - -search - -engines - - - - -usually - -cannot - -access - -and - -index - -these - -articles. - -A - -few - -academic - - - -search - -engines, - -such - -as - -Scirus - -and - -Google - -Scholar, - -cooperate - - - -with - -publishers, - -but - -still - -they - -do - -not - -cover - -all - -existing - -articles - - - -[30-32]. - -Researchers - -therefore - -need - -to - -think - -seriously - -about - -how - - - -to - -get - -their - -articles - -indexed - -by - -academic - -search - -engines. - - - - -Third, - -Webmasters - -can - -alter - -their - -pages - -by - -adding - -or - -replacing - - - -words - -and - -links, - -deleting - -pages, - -offering - -multiple - -versions - -with - - - -slight - -variations, - -and - -so - -on; - -in - -this - -way - -they - -can - -test - -new - - - -methods - -and - -adapt - -to - -changes - -in - -ranking - -algorithms. - -Scholarly - - - -authors - -can - -hardly - -do - -so: - -once - -an - -article - -is - -published, - -it - -is - - - -difficult - -and - -sometimes - -impossible - -to - -alter - -it. - -Therefore, - -ASEO - - - -needs - -to - -be - -performed - -particularly - -carefully. - - - - -Finally, - -Web - -search - -engines - -usually - -index - -all - -text - -on - -a - -Web - -site, - - - -or - -at - -least - -the - -majority - -of - -it. - -In - -contrast, - -some - -academic - -search - - - -engines - -do - -not - -index - -a - -document’s - -full - -text - -but - -instead - -index - - - -only - -the - -title - -and - -abstract. - -This - -means - -that - -for - -some - -academic - - - - -search - -engines - -authors - -need - - - -to - -focus - -on - -the - -article’s - -title - -and - - - - -abstract, - -but - -in - -other - -cases - -they - -still - -have - -to - -consider - -the - -full - -text - - - -for - -other - -search - -engines. - - - - -2.2 - -An - -Overview - -of - -Academic - -Search - - - - -(VgiVeH - -OD - -VkiVQAlgRJ - -❑hO - - - - - - - - -The - -basic - -concept - -of - -keyword-based - -searching - -is - -the - -same - -for - -all - - - -major - -(academic) - -search - -engines. - -Users - -search - -for - -a - -search - -term - - - -in - -a - -certain - -document - -field - -(e.g., - -title, - -abstract, - -body - -text), - -or - -in - - - -all - -fields, - -and - -all - -documents - -containing - -the - -search - -term - -are - -listed - - - -on - -the - -results - -page. - -Academic - -search - -engines - -use - -different - - - -ranking - -algorithms - -to - -determine - -in - -which - -position - -the - -results - -are - - - -displayed. - -Some - -let - -the - -user - -choose - -one - -factor - -on - -which - -to - -rank - - - -the - -results - -(common - -ranking - -factors - -are - -publication - -date, - -citation - - - -count, - -author - -or - -journal - -name - -and - -reputation, - -and - -relevance - -of - - - -the - -document); - -others - -combine - -the - -ranking - -factors - -into - -one - - - -algorithm, - -and, - -more - -often - -than - -not, - -the - -user - -has - -no - -influence - -on - - - - -the - -factor’s - -weighting. - - - - - - - - - -The - - -relevance - - -of - -a - -document - -is - -basically - -a - -function - -of - -how - -often - - - - -the - -search - -term - -occurs - -in - -that - -document - -and - -in - -which - -part - -of - -the - - - -document - -it - -occurs. - -Generally - -speaking, - -the - -more - -often - -a - -search - - - -term - -occurs - -in - -the - -document, - -and - -the - -more - -important - -the - - - -document - -field - -is - -in - -which - -the - -term - -occurs, - -the - -more - -relevant - -the - - - - -document - -is - - - -considered - -4 - -. - - - -This - -means - -that - -an - -occurrence - -in - -the - - - - - - - - - - - -4 - - -Some - -algorithms, - -such - -as - -the - -BM25(f - -), - -saturate - -when - -a - -word - - - - -occurs - -often - -in - -the - -text - -[36]. - - - - - - -title - -is - -weighted - -more - -heavily - -than - -an - -occurrence - -in - -the - -abstract, - - - -which - -carries - -more - -weight - -than - -an - -occurrence - -in - -a - -(sub)heading, - - - -than - -in - -the - -body - -text, - -and - -so - -on. - -Possible - -document - -fields - -that - - - - -may - -be - -weighted - -differently - -by - -academic - -search - -engines - - - -are: - -5 - - - - - - - - - - -• - -Title - - - - - - -• - -Author - -names - - - - - - -• - -Abstract - - - - - - -• - -(Sub)headings - - - - - - -• - -Author - -keywords - - - - - - -• - -Body - -text - - - - - - -• - -Tables - -and - -figures - - - - - - -• - -Publication - -name - -(name - -of - -journal, - -conference, - - - -proceedings, - -book, - -etc.) - - - - - - -• - -User - -keywords - -(Social - -tags) - - - - - - -• - -Social - -annotations - - - - - - -• - -Description - - - - - - -• - -Filename - - - - - - -• - -URI - - - - -The - -metadata - -of - -electronic - -files - -are - -especially - -important - -for - - - -academic - -search - -engines - -crawling - -the - -Web. - -When - -a - -search - - - -engine - -finds - -a - -PDF - -on - -the - -Web, - -it - -does - -not - -know - -whether - -this - - - -PDF - -represents - -an - -academic - -article, - -or - -which - -one - -it - -belongs - -to; - - - -therefore, - -the - -PDF - -must - -be - -identified, - -and - -one - -way - -to - -do - -this - -is - - - -by - -extracting - -the - -author - -and - -title. - -This - -can - -be - -done - -by - -analyzing - - - -the - -full - -text - -of - -the - -document - -or - -the - -metadata - -of - -the - -PDF. - - - - -It - -is - -also - -important - -to - -note - -that - -text - -in - -figures - -and - -tables - -usually - - - -is - -indexed - -only - -if - -it - -is - -embedded - -as - -real - -text - -or - -within - -a - -vector - - - -graphic. - -If - -text - -is - -embedded - -as - -a - -raster - -graphic - -(e.g., - -*.bmp, - - - -*.png, - -*.gif, - -*.tif, - -*.jpg), - -most, - -if - -not - -all, - -search - -engines - -will - -not - - - -index - -the - -text - -(see - -Figures - -1 - -and - -2 - -for - -an - -illustration - -of - - - - -differences - -between - -vector - -and - -raster/bitmap - - - -graphics). - -6 - - - -To - -our - - - - -knowledge, - -none - -of - -the - -major - -academic - -search - -engines - -currently - - - -considers - -synonyms. - -This - -means - -that - -a - -document - -containing - -only - - - -the - -term - -‘academic - -search - -engine’ - -would - -not - -be - -found - -via - -a - - - -search - -for - -‘scientific - -paper - -search - -engine’ - -or - -‘academic - - - -database.’ - -What - -most - -academic - -search - -engines - -do - -is - -stemming: - - - -words - -are - -reduced - -to - -their - -stems - -(e.g., - -‘analysed’ - -and - -‘analysing’ - - - - -would - -be - -reduced - -to - -‘analyse’). - - - - - - - - -2.3 - - -*RR��OHchRlDKRODVkiVg❑ADRri - -❑hm - - - - - - - - -Google - -Scholar - -is - -one - -of - -those - -search - -engines - -that - -combine - - - -several - -factors - -into - -one - -ranking - -algorithm. - -The - -most - -important - - - -factors - -are - -relevance, - -citation - -count, - -author - -name(s), - -and - -name - -of - - - - -publication. - -7 - - - - - - - - - - - - - - -5 - - -Some - -of - -the - -data - -could - -be - -retrieved - -from - -the - -document - -full - - - - -text, - -other - -from - -the - -metadata - -(of - -electronic - -files) - - - - -6 - - -Theoretically - -search - -engines - -could - -index - -the - -text - -in - - - - -raster/bitmap - -graphics, - -but - -they - -would - -have - -to - -apply - -optical - - - -character - -recognition - -(OCR). - -To - -our - -knowledge, - -no - -search - - - -engine - -currently - -does - -this, - -although - -some - -are - -using - -OCR - -to - - - -index - -complete - -scans - -of - -scholarly - -literature. - - - - -7 - - -Google - -Scholar - -offers - -different - -search - -functions. - -For - -instance, - -it - - - - -is - -possible - -to - -search - -for - -‘related - -articles’ - -and - -‘recent - -articles.’ - - - -In - -this - -article - -we - -focus - -on - -the - -normal - -ranking - -algorithm, - -which - - - -is - -applied - -for - -the - -standard - -keyword - -search. - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - -2.3.1 - -Relevance - - - - -Google - -Scholar - -focuses - -strongly - -on - -document - -titles. - -Documents - - - -containing - -the - -search - -term - -in - -the - -title - -are - -likely - -to - -be - -positioned - - - -near - -the - -top - -of - -the - -results - -list. - -Google - -Scholar - -also - -seems - -to - - - -consider - -the - -length - -of - -a - -title: - -In - -a - -search - -for - -the - -term - -‘SEO,’ - -a - - - - -document - -titled - -‘SEO: - -An - -Overview’ - -would - -be - -ranked - -higher - - - - - - - -t - -han - - - -one - -titled - -‘Search - -Engine - -Optimization - -(SEO): - -A - -Literature - - - - - - - -Survey - -of - -the - -Current - -State - -of - -the - -Art.’ - - - - - - - - - -Although - -Google - -Scholar - -indexes - -entire - -documents, - -the - -total - - - - - - - -search - -term - -count - -in - -the - - - -doc - -u - -ment - - - -has - - - -l - -ittle - - - -or - -no - -impact. - -In - -a - - - - -search - -for - -‘recommender - -systems,’ - -a - -document - -containing - -fifty - - - - -instances - -of - -this - -term - -would - -not - -necessarily - -be - -ranked - -higher - - - - - - -than - -a - -document - -containing - -only - -ten - -instances. - - - - - - -Figure - -1: - -Example - -of - -a - -Vector - -Graphic - - - - -Like - -other - -search - -engines, - -Google - -Scholar - -does - -not - -index - -text - -in - - - -figures - -and - -tables - -inserted - -as - -raster/bitmap - -graphics, - -but - -it - -does - - - -index - -text - -in - -vector - -graphics. - -It - -is - -also - -known - -that - -neither - - - -synonyms - -nor - -PDF - -metadata - -are - -considered. - - - - -2.3.2 - -Citation - -Counts - - - - - -Citation - -counts - -play - -a - -major - -role - -in - -Google - - - -Scholar’ - -s - - - -ranking - - - - -algorithm, - -as - -illustrated - -in - -Figure - -3, - -which - -shows - -the - -mean - - - - -citation - -count - -for - -each - -position - -in - -Google - - - -Scholar. - -8 - - - -It - -is - -clear - - - - -that, - -on - -average, - -articles - -in - -the - -top - -positions - -have - -significantly - - - -more - -citations - -than - -articles - -in - -the - -lowest - -positions. - -This - -means - - - -that - -to - -achieve - -a - -good - -ranking - -in - -Google - -Scholar, - -many - -citations - - - -are - -essential. - -Google - -Scholar - -seems - -not - -to - -differentiate - -between - - - -self-citations - -and - -citations - -by - -third - -parties. - - - - - -8 - - -On - -average, - -articles - -at - -position - -1 - -had - -834 - -citations, - -articles - -at - - - - -position - -2 - -had - -552, - -articles - -at - -position - -3 - -had - -426, - -and - -articles - - - -at - -position - -1000 - -had - -fifty-three. - -The - -study - -was - -based - -on - - - -1,032,766 - -results - -produced - -by - -1050 - -search - -queries - -in - - - -November - -2008. - -For - -more - -detail - -see - -[1]. - - - - - - - - -Figure - -2: - -Example - -of - -a - -Bitmap - -Graphic - - - - -2.3.3 - -Author - -and - -Publication - -Name - - - - -If - -the - -search - -query - -includes - -an - -author - -or - -publication - -name, - -a - - - -document - -in - -which - -either - -appears - -is - -likely - -to - -be - -ranked - -high. - -For - - - - -instance, - -seventy-four - - - -o - -f - - - -the - -top - -100 - -results - -of - -a - -search - -for - - - - - -‘arteriosclerosis - -and - - - -thr - -o - -mbosis - - - -cure' - -we - - - -re - -arti - -d - -es - -about - -vari - -ous - - - - - -(medical) - -topics - -from - - - -t - -h - -e - - - -journal - - - -Arteriosc - -l - -erosis, - - - - - -T - -hrombosi - -s - -, - - - - - - - -and - -V - -ascul - -ar - -Bi - -of - -ogy, - -many - -of - - - -w - -h - -i - - - -ch - -di - -d - -not - -i - -nd - -ude - -the - -search - - - - -term - -either - -in - -the - -title - -or - -in - -the - -full - -text - -[2]. - - - - - - - -Figure - -3: - -Mean - -Citation - -Count - -per - - - -Position - -8 - - - - - - - - -2.3.4 - -Other - -factors - - - - -Google - -Scholar’s - -standard - -search - -does - -not - -consider - -publication - - - -dates. - -However, - -Google - -Scholar - -offers - -a - -special - -search - -function - - - - -for - -‘recent - -articles,’ - -which - -limits - -results - -to - -articles - - - -publ - -ished - - - - - - -within - -the - -past - -five - -years. - -Furthermore, - -Google - -Scholar - -claims - -to - - - -consider - -both - -publication - -and - -author - -reputation - -[33]. - -However, - - - -we - -could - -not - -research - -the - -influence - -of - -these - -factors - -because - -of - -a - - - -lack - -of - -data, - -and - -therefore - -we - -do - -not - -consider - -them - -here. - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - -2.3. - -5 - -Sources - -Indexed - -by - -Google - -Scholar - - - - -Bert - -van - -Heerde, - -a - -professional - -in - -the - -field - -of - -SEO, - -uses - -the - - - -term - -‘invitation - -based - -search - -engine’ - -to - -describe - -Google - -Scholar: - - - - -Only - -articles - -from - -trusted - -sources - -and - -articles - - - -tha - -t - - - -are - -‘invited’ - - - - -(cited) - -by - -articles - -already - -indexed - -are - -included - -in - -the - -database - - - - -[34]. - - - -‘Trusted - -sources,’ - -in - -this - -case, - -are - -publishers - -that - -cooperate - - - - -directly - -with - -Google - -Scholar, - -as - -well - -as - -publishers - -and - - - -Webmasters - -who - -have - -requested - -that - -Google - -Scholar - -crawl - -their - - - - -databases - -and - -Web - - - -sites. - -9 - - - - - - - - -Once - -an - -article - -is - -included - -in - -Google - -Scholar’s - -database, - -Google - - - -Scholar - -searches - -the - -Web - -for - -corresponding - -PDF - -files, - -even - -if - -a - - - - -trusted - -publisher - -has - -already - -provided - -the - -full - -text. - - -10 - - -It - -makes - -no - - - - -difference - -on - -which - -site - -the - -PDF - -is - -published; - -for - -instance, - - - -Google - -Scholar - -has - -indexed - -PDF - -files - -of - -our - -articles - -from - -the - - - -publisher’s - -site, - -our - -university’s - -site, - -our - -private - -home - -pages, - - - -and - -SciPlore.org. - -PDFs - -found - -on - -the - -Web - -are - -linked - -directly - -on - - - -Google - -Scholar’s - -results - -pages, - -in - -addition - -to - -the - -link - -to - -the - - - -publisher’s - -full - -text - -(see - -Figure - -4 - -for - -an - -illustrative - -example). - - - - - - -Figure - -4: - -Linking - -database - -entries - -with - -external - -PDFs - - - - -If - -different - -PDF - -files - -of - -an - -article - -exist, - -Google - -Scholar - -groups - - - - -them - -to - -improve - -the - -article’s - -ranking - - - -[35]. - -For - -instance, - -if - -a - - - - -preprint - -version - -of - -an - -article - -is - -available - -on - -the - -author’s - -Web - - - - -page - -and - -the - -final - -version - -is - -available - -on - -the - - - -publis - -her’s - - - -site, - - - - -Google - -indexes - -both - -as - -one - -version. - -If - -the - -two - -versions - -contain - - - -different - -words, - -Google - -Scholar - -associates - -all - -contained - -words - - - -with - -the - -article. - -This - -is - -an - -interesting - -feature - -that - -we - -will - - - -discuss - -in - -more - -detail - -in - -the - -next - -section. - - - - -3. - -OPTIMIZING - -SCHOLARLY - - - -LITERATURE - -FOR - -GOOGLE - -SCHOLAR - - - -AND - -OTHER - -ACADEMIC - -SEARCH - - - -ENGINES - - - - -3.1 - -Preparation - - - - -In - -the - -beginning - -it - -is - -necessary - -to - -think - -about - -the - -most - -important - - - -words - -that - -are - -relevant - -to - -the - -article. - -It - -is - -not - -possible - -to - - - -optimize - -one - -document - -for - -dozens - -of - -keywords, - -so - -it - -is - -better - -to - - - -choose - -a - -few. - -There - -are - -tools - -that - -help - -in - -selecting - -the - -right - - - -keywords, - -such - -as - -Google - -Trends, - -Google - -Insights, - -Google - - - - -Adwords - -keyword - -tool, - -Google - - - -Search - -– - -based - - - -keyword - -tool, - -and - - - - - -Spacky. - -11 - - - - - - - - - - - - - - -9 - - -Visit - -http://www.google.com/support/scholar/bin/request.py - -to - - - - -ask - -Google - -Scholar - -to - -crawl - -your - -Web - -site - -containing - -scholarly - - - -articles. - - - - -10 - - -Google - -Scholar - -also - -indexes - -other - -file - -types, - -such - -as - - - - -PostScript - -(*.ps), - -Microsoft - -Word - -(*.doc), - -and - -MS - -PowerPoint - - - -(*.ppt). - -Here - -we - -focus - -on - -PDF, - -which - -is - -the - -most - -common - - - -format - -for - -scientific - -articles. - - - - -11 - - -Google - -Trends - -http://www.google.com/trends - - - - - -Google - -Insights - -http://www.google.com/insights/search/ - - - - - - - -It - -might - -be - -wise - -not - -to - -select - -those - -keywords - -that - -are - -most - - - -popular. - -It - -is - -usually - -a - -good - -idea - -to - -query - -the - -common - -academic - - - -search - -engines - -using - -each - -proposed - -keyword; - -if - -the - -search - - - -already - -returns - -hundreds - -of - -documents, - -it - -may - -be - -better - -to - - - - -choose - -another - -keyword - -with - -less - -competition. - - -12 - - - - - - - -3.2 - -Writing - -Your - -Article - - - - -Once - -the - -keywords - -are - -chosen, - -they - -need - -to - -be - -mentioned - -in - -the - - - -right - -places: - -in - -the - -title, - -and - -as - -often - -as - -possible - -in - -the - -abstract - - - -and - -the - -body - -of - -the - -text - -(but, - -of - -course, - -not - -so - -often - -as - -to - -annoy - - - -readers). - -Although - -in - -general - -titles - -should - -be - -fairly - -short, - -we - - - -suggest - -choosing - -a - -longer - -title - -if - -there - -are - -many - -relevant - - - -keywords. - - - - -Synonyms - -of - -important - -keywords - -should - -also - -be - -mentioned - -a - -few - - - -times - -in - -the - -body - -of - -the - -text, - -so - -that - -the - -article - -may - -be - -found - -by - - - -someone - -who - -does - -not - -know - -the - -most - -common - -terminology - -used - - - -in - -the - -research - -field. - -If - -possible, - -synonyms - -should - -also - -be - - - -mentioned - -in - -the - -abstract, - -particularly - -because - -some - -academic - - - -search - -engines - -do - -not - -index - -the - -document’s - -full - -text. - - - - - -Be - -consistent - -in - -spelling - -people’s - -names, - -taking - -special - - - -ca - -re - - - - - - -with - -names - -that - -contain - -special - -characters. - -If - -names - -are - -used - - - -inconsistently, - -search - -engines - -may - -not - -be - -able - -to - -identify - -articles - - - -or - -citations - -correctly; - -as - -a - -consequence, - -citations - -may - -be - - - -assigned - -incorrectly, - -and - -articles - -will - -not - -be - -as - -highly - -ranked - -as - - - - -they - -could - -be. - -For - -instance, - - - -Jöran - -, - - - - - -Joeran - -, - - - -and - - -Joran - - -are - -all - - - - -correct - -spellings - -of - -the - -same - -name - -(given - -different - -transcription - - - -rules), - -but - -Google - -Scholar - -sees - -them - -as - -three - -different - -names. - - - - -The - -article - -should - -use - -a - -common - -scientific - -layout - -and - -structure, - - - - -including - -standard - -sections: - - - -introduction - -, - - - - - -related - - - -work - -, - - - - - -results - -, - - - - - - -and - -so - -on. - -A - -common - -scientific - -layout - -and - -structure - -will - -help - - - -Web-based - -academic - -search - -engines - -to - -identify - -an - -article - -as - - - -scientific. - - - - -Academic - -search - -engines, - -and - -especially - -Google - -Scholar, - -assign - - - -significant - -weight - -to - -citation - -counts. - -Citations - -influence - -whether - - - -articles - -are - -indexed - -at - -al - -l, - -and - -they - -also - -influence - -the - -ranking - -of - - - -articles. - -We - -do - -not - -want - -to - -encourage - -readers - -to - -build - -‘citation - - - - -circles,’ - -or - -to - -take - -any - -other - -unethical - - - -act - -ion. - - - -But - -any - -published - - - - -articles - -you - -have - -read - -that - -relate - -to - -your - -current - -research - -paper - - - -should - -be - -cited. - -When - -referencing - -your - -own - -published - -work, - -it - -is - - - -important - -to - -include - -a - -link - -where - -that - -work - -can - -be - -downloaded. - - - -This - -helps - -readers - -to - -find - -your - -article - -and - -helps - -academic - -search - - - -engines - -to - -index - -the - -referenced - -article’s - -full - -text. - -Of - -course, - -this - - - -can - -also - -be - -done - -for - -other - -articles - -that - -have - -well-known - -(i.e., - - - -stable - -and - -possibly - -canonical) - -download - -locations. - - - - -3.3 - -Preparing - -for - -Publication - - - - -Text - -in - -figures - -and - -tables - -should - -be - -machine - -readable - -(i.e., - - - -vector - -graphics - -containing - -font-based - -text - -should - -be - -used - -instead - - - - - - - - - - -Google - -Adwords - - - -https://adwords.google.com/select/KeywordToolExternal; - - - -Google - -keyword - -tool, - -http://google.com/sktool/ - - - - -Spacky, - -http://www.spacky.com - - - - -12 - - -For - -example, - -keywords - -such - -as - -‘Web’ - -and - -‘HTML’ - -may - -be - -of - - - - -limited - -use - -because - -there - -are - -too - -many - -papers - -published - -in - -that - - - -space, - -in - -which - -case - -it - -makes - -more - -sense - -to - -narrow - -the - -scope - - - -and - -choose - -better-differentiated - -keywords. - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - -of - -rasterized - -images) - -so - -that - -it - -can - -easily - -be - -indexed - -by - -academic - - - -search - -engines. - -Vector - -graphics - -also - -look - -more - -professional, - -and - - - -are - -more - -user - -friendly, - -than - -raster/bitmap - -graphics. - -Graphics - - - -stored - -as - -JPEG, - -BMP, - -GIF, - -TIFF, - -or - -PNG - -files - -are - -not - -vector - - - -graphics. - - - - -When - -documents - -are - -converted - -to - -PDF, - -all - -metadata - -should - -be - - - -correct - -(especially - -author - -and - -title). - -Some - -search - -engines - -use - - - -PDF - -metadata - -to - -identify - -the - -file - -or - -to - -display - -information - -about - - - -the - -article - -on - -the - -search - -results - -page. - -It - -may - -also - -be - -beneficial - -to - - - -give - -a - -meaningful - -file - -name - -to - -each - -article. - - - - -3.4 - -Publishing - - - - -As - -part - -of - -the - -optimization - -process, - -authors - -should - -consider - -the - - - - -journal’s - -or - -publisher’s - -policies. - - - -Open- - -access - - - -articles - -usually - - - - -receive - -more - -citations - -than - -articles - -accessible - -only - -by - -purchase - - - -or - -subscription; - -and, - -obviously, - -only - -articles - -that - -are - -available - -on - - - -the - -Web - -can - -be - -indexed - -by - -Web-based - -academic - -search - -engines. - - - -Accordingly, - -when - -selecting - -a - -journal - -or - -publisher - -for - - - -submission, - -authors - -should - -favor - -those - -that - -cooperate - -with - - - -Google - -Scholar - -and - -other - -academic - -search - -engines, - -since - -the - - - -article - -will - -potentially - -obtain - -more - -readers - -and - -receive - -more - - - - -citations. - - -13 - - -If - -a - -journal - -does - -not - -publish - -online, - -authors - -should - - - - -favor - -publishers - -who - -at - -least - -allow - -authors - -to - -put - -their - -articles - - - - -on - -their - -or - -their - -institutions’ - -home - -pages. - - - - - - - - -3.5 - -Follow-Up - - - - -There - -are - -three - -ways - -to - -optimize - -articles - -for - -academic - -search - - - -engines - -after - -publication. - - - - - -The - -first - -is - - - -t - -o - - - -publish - -the - -article - -on - -the - -author’s - -home - -page, - -so - - - - -that - -Web-based - -academic - -search - -engines - -can - -find - -and - -index - -it - - - -even - -if - -the - -journal - -or - -publisher - -does - -not - -publish - -the - -article - - - -online. - -An - -author - -who - -does - -not - -have - -a - -Web - -page - -might - -post - - - -articles - -on - -an - -institutional - -Web - -page - -or - -upload - -it - -to - -a - -site - -such - -as - - - -Sciplore.org, - -which - -offers - -researchers - -a - -personal - -publications - - - -home - -page - -that - -is - -regularly - -crawled - -by - -Google - -Scholar - -(and, - -of - - - -course, - -by - -SciPlore - -Search). - -However, - -it - -is - -important - -to - - - -determine - -that - -posting - -or - -uploading - -the - -article - -does - -not - - - -constitute - -a - -violation - -of - -the - -author’s - -agreement - -with - -the - - - -publisher. - - - - -Second, - -an - -article - -that - -includes - -outdated - -words - -might - -be - - - -replaced - -by - -either - -updating - -the - -existing - -article - -or - -publishing - -a - - - - -new - -version - -on - -the - -author’s - -home - - - -p - -age. - - - -Google - -Scholar, - -at - -least, - - - - -considers - -all - -versions - -of - -an - -article - -available - -on - -the - -Web. - -We - - - -consider - -this - -a - -good - -way - -of - -making - -older - -articles - -easier - -to - -find. - - - -However, - -this - -practice - -may - -also - -violate - -your - -publisher’s - - - -copyright - -policy, - -and - -it - -may - -also - -be - -considered - -misbehavior - -by - - - -other - -researchers. - -It - -could - -also - -be - -a - -risky - -strategy: - -at - -some - -point - - - -in - -the - -future, - -search - -engines - -may - -come - -to - -classify - -this - -practice - -as - - - -spamming. - -In - -any - -case, - -updated - -articles - -should - -be - -clearly - -labeled - - - -as - -such, - -so - -that - -readers - -are - -aware - -that - -they - -are - -reading - -a - - - -modified - -version. - - - - -Third, - -it - -is - -important - -to - -create - -meaningful - -parent - -Web - -pages - -for - - - -PDF - -files. - -This - -means - -that - -Web - -pages - -that - -link - -to - -the - -PDF - -file - - - -should - -mention - -the - -most - -important - -keywords - -and - -the - -PDFs - - - - - - - - - - -13 - - -The - -main - -criteria - -for - -selecting - -a - -publisher - -or - -journal, - -of - - - - -course, - -should - -still - -be - -its - -reputation - -and - -its - -general - -suitability - - - -for - -the - -paper. - -The - -policy - -is - -to - -be - -seen - -as - -an - -additional - -factor. - - - - - - -metadata - -(title, - -author, - -and - -abstract). - -We - -do - -not - -know - -whether - - - -any - -academic - -search - -engines - -are - -considering - -these - -data - -yet, - -but - - - -normal - -search - -engines - -do - -consider - -them, - -and - -it - -seems - -only - -a - - - -matter - -of - -time - -before - -academic - -search - -engines - -do, - -too. - - - - -4. - -DISCUSSION - - - - -As - -was - -true - -in - -the - -beginning - -for - -classic - -SEO, - -there - -are - -some - - - -reservations - -about - -ASEO - -in - -the - -academic - -community. - -When - -we - - - - -submitted - -our - -study - -about - -Google - -Scholar’s - -ranking - -algorithm - - - - - - -[2] - -to - -a - -conference, - -it - -was - -rejected. - -One - -reviewer - -provided - -the - - - -following - -feedback: - - - - - -I’m - -not - -a - -big - -fan - -of - -this - -area - -of - -research - -[...]. - -I - -know - -it’s - -in - - - - -the - -call - -for - -papers, - -but - -I - - - -think - -that’s - -a - -mistake. - - - - - - - - -A - -second - -reviewer - -wrote, - - - - -[This] - -paper - -seems - -to - -encourage - -scientific - -paper - -authors - -to - - - - -learn - -Google - -scholar’s - -ranking - - - -metho - -d - - - -and - -write - -papers - - - - -accordingly - -to - -boost - -ranking - -[which - -is - -not] - -acceptable - -to - - - -scientific - -communities - -which - -are - -supposed - -to - -advocate - -true - - - -technical - -quality/impact - -instead - -of - -ranking. - - - - -ASEO - -should - -not - -be - -seen - -as - -a - -guide - -on - -how - -to - -cheat - -academic - - - -search - -engines. - -Rather, - -it - -is - -about - -helping - -academic - -search - - - -engines - -to - -understand - -the - -content - -of - -research - -papers - -and, - -thus, - - - -about - -how - -to - -make - -this - -content - -more - -widely - -and - -easily - -available. - - - -Certainly, - -we - -can - -anticipate - -that - -some - -researchers - -will - -try - -to - - - -boost - -their - -rankings - -in - -illegitimate - -ways. - -However, - -the - -same - - - -problem - -exists - -in - -regular - -Web - -searching; - -and - -eventually - -Web - - - -search - -engines - -manage - -to - -avoid - -spam - -with - -considerable - -success, - - - -and - -so - -will - -academic - -search - -engines. - -In - -the - -long - -term, - -ASEO - - - - -will - -be - -beneficial - -for - -all - - -– - - -authors, - -search - -engines, - -and - -users - -of - - - - -search - -engines. - -Therefore, - -we - -believe - -that - -academic - -search - - - -engine - -optimization - -(ASEO) - -should - -be - -a - -common - -procedure - -for - - - -researchers, - -similar - -to, - -for - -instance, - -selecting - -an - -appropriate - - - -journal - -for - -publication. - - - - -ACKNOWLEDGEMENTS - - - - -We - -thank - -the - -SEO - -Bert - -van - -Heerde - -from - -Insyde - - - -(http://www.insyde.nl/) - -for - -his - -valuable - -feedback, - -and - -Barbara - - - -Shahin - -for - -proofreading - -this - -article. - - - - -ABOUT - -THE - -AUTHORS - - - - -The - -research - -career - -of - -Jöran - -Beel - -and - -Bela - -Gipp - -began - -about - -ten - - - -years - -ago - -when - -they - -won - -second - -prize - -in - -Jugend - -Forscht, - - - -Germany’s - -largest - -and - -most - -reputable - -youth - -science - -competition - - - -and - -received - -awards - -from, - -among - -others, - -German - -Chancellor - - - -Gerhard - -Schröder - -for - -their - -outstanding - -research - -work. - -In - -2007, - - - -they - -graduated - -with - -distinction - -at - -OVGLI - -Magdeburg, - -Germany, - - - -in - -the - -field - -of - -computer - -science. - -They - -now - -work - -for - -the - -VLBA- - - -Lab - -and - -are - -PhD - -students, - -currently - -at - -LIC - -Berkeley - -as - -visiting - - - -student - -researchers. - -During - -the - -past - -years - -they - -have - -published - - - -several - -papers - -about - -academic - -search - -engines - -and - -research - -paper - - - -recommender - -systems. - - - - -Erik - -Wilde - -is - -Adjunct - -Professor - -at - -the - -LIC - -Berkeley - -School - -of - - - -Information. - -He - -began - -his - -work - -in - -Web - -technologies - -and - -Web - - - -architectures - -a - -little - -over - -ten - -years - -ago - -by - -publishing - -the - -first - - - -book - -providing - -a - -complete - -overview - -of - -Web - -technologies. - -After - - - -focusing - -for - -some - -years - -on - -XML - -technologies, - -XML - -and - - - -modelling, - -mapping - -issues - -between - -XML - -and - -non-tree - - - -metamodels, - -and - -XML-centric - -design - -of - -applications - -and - -data - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - -models, - -he - -has - -recently - -shifted - -his - -main - -focus - -to - -information - -and - - - -application - -architecture, - -mobile - -applications, - -geo-location - -issues - - - -on - -the - -Web, - -and - -how - -to - -design - -data - -sharing - -that - -is - -open - -and - - - -accessible - -for - -many - -different - -service - -consumers. - - - - -REFERENCES - - - - - - -[1] - -Jöran - -Beel - -and - -Bela - -Gipp. - -Google - -Scholar’s - -Ranking - - - -Algorithm: - -The - -Impact - -of - -Citation - -Counts - -(An - -Empirical - -Study). - - - - -In - -André - -Flory - -and - -Martine - -Collard, - -editors, - - - -Proceedings - -of - -the - - - - -3rd - -IEEE - -International - -Conference - -on - -Research - -Challenges - -in - - - - -]nfoO - -atYon - -ScY6nc6F�&]���1 - - - -� - -, - - - -pages - - - -439 - -– - -446, - - - -Fez - -(Morocco), - - - - -April - -2009. - -IEEE. - -doi: - -10.1109/RCIS.2009.5089308. - -ISBN - -978- - - - -1-4244-2865-6. - -Available - -on - -http://www.sciplore.org. - - - - - - -[2] - -Jöran - -Beel - -and - -Bela - -Gipp. - -Google - -Scholar’s - -Ranking - - - -Algorithm: - -An - -Introductory - -Overview. - -In - -Birger - -Larsen - -and - - - - -Jacqueline - -Leta, - -editors, - - - -Proceedings - -of - -the - -12th - -International - - - - - -&onf6O6nc6 - -oFH�Y6�Q - -6�OYHQnG - -]�HO - -6�OYV - -�]��]��1 - - - -� - -, - - - - - - - -volume - -1, - -pages - - - -230 - -– - -241, - - - -Rio - -de - -Janeiro - -(Brazil), - -July - -2009. - - - - -International - -Society - -for - -Scientometrics - -and - -Informetri - -cs. - -ISSN - - - -2175-1935. - -Available - -on - -http://www.sciplore.org. - - - - - - -[3] - -Jöran - -Beel - -and - -Bela - -Gipp. - -Google - -Scholar’s - -Ranking - - - -Algorithm: - -The - -Impact - -of - -Articles’ - -Age - -(An - -Empirical - -Study). - -In - - - - -Shahram - -Latifi, - -editor, - - - -Proceedings - -of - -the - -6th - -International - - - - -Conference - -on - -Information - -Technology: - -New - -Generations - - - - -�]�71*��1 - - - -� - -, - - - -pages - - - -160 - -– - -164, - - - -Las - -Vegas - -(USA), - -April - -2009. - -IEEE. - - - - -doi: - -10.1109/ITNG.2009.317. - -ISBN - -978-1424437702. - -Available - - - -on - -http://www.sciplore.org. - - - - - - - -[4] - -Google. - -Google’s - - - -S - -earch - - - -Engine - -Optimization - -Starter - -Guide. - - - - -PDF, - -November - -2008. - -URL - -http://www.google.com/- - - - -webmasters/docs/search-engine-optimization-starter-guide. - -pdf. - - - - - - -[5] - -Albert - -Bifet - -and - -Carlos - -Castillo. - -An - -Analysis - -of - -Factors - -Used - - - - -in - -Search - -Engine - -Ranking. - -In - - - -Proceedings - -of - -the - -14th - - - - -International - -World - -Wide - -Web - -Conference - -(WWW2005), - -First - - - -International - -Workshop - -on - -Adversarial - -Information - -Retrieval - -on - - - - -t - -56 - -11 - -6b - -��]�11 - - - -6����� - -, - - - -2005. - - - - -http://airweb.cse.lehigh.edu/2005/bifet.pdf. - - - - - - -[6] - -Michael - -P. - -Evans. - -Analysing - -Google - -rankings - -through - -search - - - - -engine - -optimization - -data. - - - -Internet - - - -Research - -, - - - -17 - -(1): - - - -21 - -– - -37, - - - -2007. - - - - -doi: - -10.1108/10662240710730470. - - - - - - -[7] - -Jin - -Zhang - -and - -Alexandra - -Dimitroff. - -The - -impact - -of - -metadata - - - -implementation - -on - -webpage - -visibility - -in - -search - -engine - -results - - - - -(Part - -II). - - - -Cross-Language - -Information - - - -Retrieval - -, - - - -41 - -(3): - - - -691 - -– - - - - - - -715, - -May - -2005. - - - - - - - -[8] - -Harold - -Davis. - - - -Search - -Engine - - - -Optimization - -. - - - -O’Reilly, - -2006. - - - - - - - - - - - -[9] - -Jennifer - -Grappone - -and - -Gradiva - -Couzin. - - - -Search - -Engine - - - - - -Optimization: - -An - -Hour - -a - - - -Day - -. - - - -John - -Wiley - -and - -Sons, - -2nd - -edition, - - - - -2008. - - - - - - - -[10] - -Peter - -Kent. - - - -Search - -engine - -optimization - -for - - - -dummies - -. - - - -Willey - - - - -Publishing - -Inc, - -2006. - - - - - - -[11] - -AA - -Benczur, - -K - -Csalogány, - -T - -Sarlós, - -and - -M - -Uher. - - - - -SpamRank - - -– - - -Fully - -Automatic - -Link - -Spam - -Detection. - -In - - - - - -AGv6OsaOYal - -]nDO - -atYoRR6tOY6vaRQ❑ - -56 - -11 - -6b - -��Y�11 - - - -���Q� - -, - - - - - - -2005. - - - - - - -[12] - -A. - -Benczúr, - -K. - -Csalogány, - -and - -T. - -Sarlós. - -Link-based - - - - -similarity - -search - -to - -fight - -web - -spam. - - - -Adversarial - -Information - - - - - - - - -Retrieval - -on - -the - -Web - -(AIR - -WEB), - -Seattle, - -Washington, - - - -USA - -, - - - - - - -2006. - - - - -[13] - -I. - -Drost - -and - -T. - -Scheffer. - -Thwarting - -the - -nigritude - - - - -ultramarine: - -Learning - -to - -identify - -link - -spam. - - - -Lecture - -Notes - -in - - - - - -Computer - - - -Science - -, - - - -3720: - -96, - -2005. - - - - - - - -[14] - -D. - -Fetterly, - -M. - -Manasse, - -and - -M. - -Najork. - -Spam, - -damn - -spam, - - - -and - -statistics: - -Using - -statistical - -analysis - -to - -locate - -spam - -web - - - - -pages. - -pages - - - -1 - -– - -6, - - - -2004. - - - - - - - -[ 15] - -Q. - -Gan - -and - -T. - -Suel - -. - -Improving - -web - -spam - -classifiers - -using - - - - -link - -structure. - -In - - - -Proceedings - -of - -the - -3rd - -international - -workshop - - - - - -on - -Adversarial - -information - -retrieval - -on - -the - - - -web - -, - - - -page - -20. - -ACM, - - - - -2007. - - - - - - -[ 16] - -Z. - -Gyöngyi - -and - -H. - -Garcia-Molina. - -Link - -spam - -alliances. - -In - - - -Proceedings - -of - -the - -31st - -international - -conference - -on - -Very - -large - - - - -data - - - -bases - -, - - - -page - -528. - -VLDB - -Endowment, - -2005. - - - - - -[17] - -H. - -Saito, - -M. - -Toyoda, - -M. - -Kitsuregawa, - -and - -K. - -Aihara. - -A - - - -large-scale - -study - -of - -link - -spam - -detection - -by - -graph - -algorithms. - -In - - - -Proceedings - -of - -the - -3rd - -international - -workshop - -on - -Adversarial - - - - -information - -retrieval - -on - -the - - - -web - -, - - - -page - -48. - -ACM, - -2007. - - - - - -[ - -18] - -B. - -Wu - -and - -K. - -Chel - -lapilla. - -Extracting - -link - -spam - -using - -biased - - - - -random - -walks - -from - -spam - -seed - -sets. - -In - - - -Proceedings - -of - -the - -3rd - - - - -international - -workshop - -on - -Adversarial - -information - -retrieval - -on - - - - -the - - - -web - -, - - - -page - -44. - -ACM, - -2007. - - - - - - - -[19] - -C. - -Castillo, - -D. - -Donato, - -A. - -Gionis, - -V. - -Murdock, - -and - - - -F. - -Silvestri. - -Know - -your - -neighbors: - -Web - -spam - -detection - -using - -the - - - - -web - -topology. - -In - - - -Proceedings - -of - -the - -30th - -annual - -international - - - - -ACM - -SIGIR - -conference - -on - -Research - -and - -development - -in - - - - -information - - - -retrieval - -, - - - -page - -430. - -ACM, - -2007. - - - - - - - -[20] - -G.G. - -Geng, - -C.H. - -Wang, - -and - -Q.D. - -Li. - -Improving - - - -Spamdexing - -Detection - -Via - -a - -Two-Stage - -Classification - -Strategy. - - - -page - -356, - -2008. - - - - - - -[21] - -I.S. - -Nathenson. - -Internet - -infoglut - -and - -invisible - -ink: - - - - -Spamdexing - -search - -engines - -with - -meta - -tags. - - - -Harv. - -J. - -Law - -& - - - -Tec - -, - - - - - - - -12: - - - -43 - -– - -683, - - - -1998. - - - - - - - -[22] - -T. - -Urvoy, - -E. - -Chauveau, - -P. - -Filoche, - -and - -T. - -Lavergne. - - - - -Tracking - -web - -spam - -with - -HTM - -L - -style - -similarities. - - - -ACM - - - - - -Transactions - -on - -the - -Web - - - -(TWEB) - -, - - - -2, - -2008. - - - - - - - -[23] - -T. - -Urvoy, - -T. - -Lavergne, - -and - -P. - -Filoche. - -Tracking - -web - -spam - - - - -with - -hidden - -style - -similarity. - -In - - - -AIRWeb - - - -2006 - -, - - - -page - -25, - -2006. - - - - - - - -[24] - -Masahiro - -Kimura, - -Kazumi - -Saito, - -Kazuhiro - -Kazama, - -and - - - -Shin - -ya - -Sato. - -Detecting - -Search - -Engine - -Spam - -from - -a - -Trackback - - - - -Network - -in - -Blogspace. - - - -Lecture - -Notes - -in - -Computer - -Science: - - - - -Knowledge-Based - -Intelligent - -Information - -and - -Engineering - - - - -Systems - -, - - - -3684: - - - -723 - -– - -729, - - - -2005. - -doi: - -10.1007/11554028_101. - - - - - - - -[25] - -Alexandros - -Ntoulas, - -Marc - -Najork, - -Mark - -Manasse, - -and - - - -Dennis - -Fetterly. - -Detecting - -spam - -web - -pages - -through - -content - - - - -analysis. - -In - - - -15th - -International - -Conference - -on - -World - -Wide - - - -Web - -, - - - - - - - -pages - - - -83 - -– - -92. - - - -ACM, - -2006. - - - - - - - -[26] - -Baoning - -Wu - -and - -Brian - -D. - -Davison. - -Identifying - -link - -farm - - - - -spam - -pages. - -In - - - -14th - -International - -Conference - -on - -World - -Wide - - - - - -Web - -, - - - -pages - - - -820 - -– - -829, - - - -2005. - - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - - - -[27] - -Yahoo! - -How - -do - -I - -improve - -the - -ranking - -of - -my - -web - -site - -in - -the - - - -search - -results?, - -July - -2007. - -URL - -http://help.yahoo.com/l/us/- - - - -yahoo/search/indexing/ranking-02.html. - - - - - - -[28] - -Alex - -Chitu. - -Google’s - -Market - -Share - -in - -Your - -Country. - - - -Website, - -March - -2009. - -URL - -http://googlesystem.blogspot.com/- - - - -2009/03/googles-market-share-in-your-country.html - -https://- - - - -spreadsheets.google.com/- - - - -ccc?key=pLaE9tsVLp_0y1 - -FKWBCKGBA. - - - - - - -[29] - -D. - -Lewandowski - -and - -P. - -Mayr. - -Exploring - -the - -academic - - - - -invisible - -web. - - - -Library - -Hi - - - -Tech - -, - - - -24 - -(4) - -: - - - -529 - -– - -539, - - - -2006. - - - - - - - -[30] - -Nisa - -Bakkalbasi, - -Kathleen - -Bauer, - -Janis - -Glover, - -and - -Lei - - - -Wang. - -Three - -options - -for - -citation - -tracking: - -Google - -Scholar, - - - - -Scopus - -and - -Web - -of - -Science. - - - -Biomedical - -Digital - - - -Libraries - -, - - - -3, - - - - -2006. - -doi - -: - -10.1186/1742-5581-3-7. - - - - - -[31] - -John - -J. - -Meier - -and - -Thomas - - - -W. - -Conkling. - -Google - -Scholar’s - - - - - -Coverage - -of - -the - -Engineering - -Literature: - -An - -Empirical - -Study. - - - -The - - - - - -Journal - -of - -Academic - - - -Librarianship - -, - - - -34 - -(34): - - - -196 - -– - -201, - - - -2008. - - - - - - - - - -[32] - -William - -H. - -Walters. - -Google - -Scholar - -coverage - -of - -a - - - - -multidisciplinary - -field. - - - -Information - -Processing - -& - - - -Management - -, - - - - - - - -43 - -(4) - -: - - - -1121 - -– - -1132, - - - -July - -2007. - -doi - -: - - - - -doi - -:10.1016/j - -. - -ipm.2006.08.006. - - - - - - -[33] - -Google. - -About - -Google - -Scholar. - -Website, - -2008. - -URL - -http://- - - - -scholar.google.com/intl/en/scholar/about.html. - - - - - - -[34] - -Bert - -van - -Heerde. - -RE: - -Pre-print: - -Academic - -Search - -Engine - - - -Optimization. - -Email, - -3 - -September - -2009. - - - - - - -[35] - -Google - -Scholar. - -Support - -for - -Scholarly - -Publishers. - -Website, - - - -2009. - -URL - -http://scholar.google.com/intl/en/scholar/- - - - -publishers.html. - - - - -[36] - -S. - -Robertson, - -H. - -Zaragoza, - -and - -M. - -Taylor. - -Simple - -BM25 - - - - -extension - -to - -multiple - -weighted - -fields. - -In - - - -Proceedings - -of - -the - - - - -thirteenth - -ACM - -international - -conference - -on - -Information - -and - - - - -knowledge - - - -management - -, - - - -pages - - - -42 - -– - -49. - - - -ACM - -New - -York, - -NY, - - - - -USA, - -2004. - - - -
- -
-
diff --git a/bin/2010-ASEO--preprint.txt b/bin/2010-ASEO--preprint.txt deleted file mode 100644 index d03b0fb..0000000 --- a/bin/2010-ASEO--preprint.txt +++ /dev/null @@ -1,610 +0,0 @@ -Preprint of: Jöran Beel, Bela Gipp, and Erik Wilde. Academic Search Engine Optimization (ASEO): Optimizing Scholarly Literature for Google Scholar and -Co. Journal of Scholarly Publishing, 41 (2): 176–190, January 2010. doi: 10.3138/jsp.41.2.176. University of Toronto Press. Downloaded from -http://www.sciplore.org -Academic Search Engine Optimization (ASEO): Optimizing -Scholarly Literature for Google Scholar & Co. -Döran Beel -Otto-von-Guericke University -FIN / ITI / VLBA-Lab -Germany -beel@sciplore.org -Bela Gipp -Otto-von-Guericke University -FIN / ITI / VLBA-Lab -Germany -gipp@sciplore.org -Erik Wilde -UC Berkeley -School of Information -United States -dret@berkeley.edu -ABSTRACT -This article introduces and discusses the concept of academic -search engine optimization (ASEO). Based on three recently -conducted studies, guidelines are provided on how to optimize -scholarly literature for academic search engines in general and -for Google Scholar in particular. In addition, we briefly discuss -the risk of researchers’ illegitimately ‘over-optimizing’ their -articles. -Keywords -academic search engines, academic search engine optimization, -ASEO, Google Scholar, ranking algorithm, search engine -optimization, SEO -1. INTRODUCTION -Researchers should have an interest in ensuring that their articles -are indexed by academic search engines1 such as Google Scholar, -IEEE Xplore, PubMed, and SciPlore.org, which greatly improves -their ability to make their articles available to the academic -community. Not only should authors take an interest in seeing -that their articles are indexed, they also should be interesting in -where the articles are displayed in the results list. Like any other -type of ranked search results, articles displayed in top positions -are more likely to be read. -This article presents the concept of academic search engine -optimization (ASEO) to optimize scholarly literature for -academic search engines. The first part of the article covers -related work that has been done mostly in the field of general -search engine optimization for Web pages. The second part -defines ASEO and compares it to search engine optimization for -Web pages. The third part provides an overview of ranking -algorithms of academic search engines in general, followed by an -overview of Google Scholar’s ranking algorithm. Finally, -guidelines are provided on how authors can optimize their -articles for academic search engines. This article does not cover -how publishers or providers of academic repositories can -optimize their Web sites and repositories for academic search -engines. The guidelines are based on three studies we have -recently conducted [1-3] and on our experience in developing the -academic search engine Sci Plore.org. -1 In this article we do not distinguish between ‘academic -databases’ and ‘academic search engines’; the latter term is -used as synonym for both. -2. RELATED WORK -On the Web, search engine optimization (SEO) for Web sites is a -common procedure. SEO involves creating or modifying a Web -site in a way that makes it ‘easier for search engines to both -crawl and index [its] content’ [4]. There exists a huge community -that discusses the latest trends in SEO and provides advice for -Webmasters in forums, blogs, and newsgroups.2 Even research -articles and books exist on the subject of SEO [5-10]. When SEO -began, many expressed their concerns that it would promote -spam and tweaking, and, indeed, search-engine spam is a serious -issue [11-26]. Today, however, SEO is a common and widely -accepted procedure and overall, search engines manage to -identify spam quite well. Probably the strongest argument for -SEO is the fact that search engines themselves publish guidelines -on how to optimize Web sites for search engines [4, 27]. But -similar information on optimizing scholarly literature for -academic search engines does not exist, to our knowledge.3 -2.1 Introduction to Academic Search Engine -Optimization (ASEO) -Based on the definition of search engine optimization for Web -pages (SEO), we define academic search engine optimization -(ASEO) as follows: -Academic search engine optimization (ASEO) is the creation, -publication, and modification of scholarly literature in a -way that makes it easier for academic search engines to both -crawl it and index it. -ASEO differs from SEO in four significant respects. First, for -Web search, Google is the market leader in most (Western) -countries [28]. This means that for Webmasters (focusing on -Western Internet users), it is generally sufficient to optimize their -Web sites for Google. In contrast, no such market leader exists -2 E.g. http://www.abakus-internet-marketing.de/foren -http://www.highrankings.com/forum -http://www.seo-guy.com/forum -http://www.seomoz.org/blog -http://www.seo.com/blog -http://www.abakus-internet-marketing.de/seoblog -3 Google Scholar offers some information for publishers on how -to get their articles indexed by Google Scholar and ranked well -[35]. However, this information is superficial in comparison to -other SEO articles, and the information is not aimed at authors. -for searching academic articles, and researchers would need to -optimize their articles for several academic search engines. If -these search engines are based on different crawling and ranking -methods, optimization can become complicated. -Second, Webmasters usually do not need to worry about whether -their site is indexed by a search engine: as long as any Web page -is linked to an already indexed page, it will be crawled and -indexed by Web search engines at some point. The situation is -different in academia, where only a fraction of all published -material is available on the Web and accessible to Web-based -academic search engines such as CiteSeer. Most academic -articles are stored in publishers’ databases; they are part of the -‘academic invisible web,’ [29] and (academic) search engines -usually cannot access and index these articles. A few academic -search engines, such as Scirus and Google Scholar, cooperate -with publishers, but still they do not cover all existing articles -[30-32]. Researchers therefore need to think seriously about how -to get their articles indexed by academic search engines. -Third, Webmasters can alter their pages by adding or replacing -words and links, deleting pages, offering multiple versions with -slight variations, and so on; in this way they can test new -methods and adapt to changes in ranking algorithms. Scholarly -authors can hardly do so: once an article is published, it is -difficult and sometimes impossible to alter it. Therefore, ASEO -needs to be performed particularly carefully. -Finally, Web search engines usually index all text on a Web site, -or at least the majority of it. In contrast, some academic search -engines do not index a document’s full text but instead index -only the title and abstract. This means that for some academic -search engines authors need to focus on the article’s title and -abstract, but in other cases they still have to consider the full text -for other search engines. -2.2 An Overview of Academic Search -(VgiVeH OD VkiVQAlgRJ ❑hO -The basic concept of keyword-based searching is the same for all -major (academic) search engines. Users search for a search term -in a certain document field (e.g., title, abstract, body text), or in -all fields, and all documents containing the search term are listed -on the results page. Academic search engines use different -ranking algorithms to determine in which position the results are -displayed. Some let the user choose one factor on which to rank -the results (common ranking factors are publication date, citation -count, author or journal name and reputation, and relevance of -the document); others combine the ranking factors into one -algorithm, and, more often than not, the user has no influence on -the factor’s weighting. -The relevance of a document is basically a function of how often -the search term occurs in that document and in which part of the -document it occurs. Generally speaking, the more often a search -term occurs in the document, and the more important the -document field is in which the term occurs, the more relevant the -document is considered4. This means that an occurrence in the -4 Some algorithms, such as the BM25(f ), saturate when a word -occurs often in the text [36]. -title is weighted more heavily than an occurrence in the abstract, -which carries more weight than an occurrence in a (sub)heading, -than in the body text, and so on. Possible document fields that -may be weighted differently by academic search engines are:5 -• Title -• Author names -• Abstract -• (Sub)headings -• Author keywords -• Body text -• Tables and figures -• Publication name (name of journal, conference, -proceedings, book, etc.) -• User keywords (Social tags) -• Social annotations -• Description -• Filename -• URI -The metadata of electronic files are especially important for -academic search engines crawling the Web. When a search -engine finds a PDF on the Web, it does not know whether this -PDF represents an academic article, or which one it belongs to; -therefore, the PDF must be identified, and one way to do this is -by extracting the author and title. This can be done by analyzing -the full text of the document or the metadata of the PDF. -It is also important to note that text in figures and tables usually -is indexed only if it is embedded as real text or within a vector -graphic. If text is embedded as a raster graphic (e.g., *.bmp, -*.png, *.gif, *.tif, *.jpg), most, if not all, search engines will not -index the text (see Figures 1 and 2 for an illustration of -differences between vector and raster/bitmap graphics).6 To our -knowledge, none of the major academic search engines currently -considers synonyms. This means that a document containing only -the term ‘academic search engine’ would not be found via a -search for ‘scientific paper search engine’ or ‘academic -database.’ What most academic search engines do is stemming: -words are reduced to their stems (e.g., ‘analysed’ and ‘analysing’ -would be reduced to ‘analyse’). -2.3 *RR��OHchRlDKRODVkiVg❑ADRri ❑hm -Google Scholar is one of those search engines that combine -several factors into one ranking algorithm. The most important -factors are relevance, citation count, author name(s), and name of -publication.7 -5 Some of the data could be retrieved from the document full -text, other from the metadata (of electronic files) -6 Theoretically search engines could index the text in -raster/bitmap graphics, but they would have to apply optical -character recognition (OCR). To our knowledge, no search -engine currently does this, although some are using OCR to -index complete scans of scholarly literature. -7 Google Scholar offers different search functions. For instance, it -is possible to search for ‘related articles’ and ‘recent articles.’ -In this article we focus on the normal ranking algorithm, which -is applied for the standard keyword search. -2.3.1 Relevance -Google Scholar focuses strongly on document titles. Documents -containing the search term in the title are likely to be positioned -near the top of the results list. Google Scholar also seems to -consider the length of a title: In a search for the term ‘SEO,’ a -document titled ‘SEO: An Overview’ would be ranked higher -than one titled ‘Search Engine Optimization (SEO): A Literature -Survey of the Current State of the Art.’ -Although Google Scholar indexes entire documents, the total -search term count in the document has little or no impact. In a -search for ‘recommender systems,’ a document containing fifty -instances of this term would not necessarily be ranked higher -than a document containing only ten instances. -Figure 1: Example of a Vector Graphic -Like other search engines, Google Scholar does not index text in -figures and tables inserted as raster/bitmap graphics, but it does -index text in vector graphics. It is also known that neither -synonyms nor PDF metadata are considered. -2.3.2 Citation Counts -Citation counts play a major role in Google Scholar’s ranking -algorithm, as illustrated in Figure 3, which shows the mean -citation count for each position in Google Scholar.8 It is clear -that, on average, articles in the top positions have significantly -more citations than articles in the lowest positions. This means -that to achieve a good ranking in Google Scholar, many citations -are essential. Google Scholar seems not to differentiate between -self-citations and citations by third parties. -8 On average, articles at position 1 had 834 citations, articles at -position 2 had 552, articles at position 3 had 426, and articles -at position 1000 had fifty-three. The study was based on -1,032,766 results produced by 1050 search queries in -November 2008. For more detail see [1]. -Figure 2: Example of a Bitmap Graphic -2.3.3 Author and Publication Name -If the search query includes an author or publication name, a -document in which either appears is likely to be ranked high. For -instance, seventy-four of the top 100 results of a search for -‘arteriosclerosis and thrombosis cure' we re arti d es about vari ous -(medical) topics from the journal Arteriosclerosis, Thrombosis, -and V ascul ar Bi of ogy, many of whi ch di d not i nd ude the search -term either in the title or in the full text [2]. -Figure 3: Mean Citation Count per Position8 -2.3.4 Other factors -Google Scholar’s standard search does not consider publication -dates. However, Google Scholar offers a special search function -for ‘recent articles,’ which limits results to articles published -within the past five years. Furthermore, Google Scholar claims to -consider both publication and author reputation [33]. However, -we could not research the influence of these factors because of a -lack of data, and therefore we do not consider them here. -2.3. 5 Sources Indexed by Google Scholar -Bert van Heerde, a professional in the field of SEO, uses the -term ‘invitation based search engine’ to describe Google Scholar: -Only articles from trusted sources and articles that are ‘invited’ -(cited) by articles already indexed are included in the database -[34]. ‘Trusted sources,’ in this case, are publishers that cooperate -directly with Google Scholar, as well as publishers and -Webmasters who have requested that Google Scholar crawl their -databases and Web sites.9 -Once an article is included in Google Scholar’s database, Google -Scholar searches the Web for corresponding PDF files, even if a -trusted publisher has already provided the full text. 10 It makes no -difference on which site the PDF is published; for instance, -Google Scholar has indexed PDF files of our articles from the -publisher’s site, our university’s site, our private home pages, -and SciPlore.org. PDFs found on the Web are linked directly on -Google Scholar’s results pages, in addition to the link to the -publisher’s full text (see Figure 4 for an illustrative example). -Figure 4: Linking database entries with external PDFs -If different PDF files of an article exist, Google Scholar groups -them to improve the article’s ranking [35]. For instance, if a -preprint version of an article is available on the author’s Web -page and the final version is available on the publisher’s site, -Google indexes both as one version. If the two versions contain -different words, Google Scholar associates all contained words -with the article. This is an interesting feature that we will -discuss in more detail in the next section. -3. OPTIMIZING SCHOLARLY -LITERATURE FOR GOOGLE SCHOLAR -AND OTHER ACADEMIC SEARCH -ENGINES -3.1 Preparation -In the beginning it is necessary to think about the most important -words that are relevant to the article. It is not possible to -optimize one document for dozens of keywords, so it is better to -choose a few. There are tools that help in selecting the right -keywords, such as Google Trends, Google Insights, Google -Adwords keyword tool, Google Search–based keyword tool, and -Spacky.11 -9 Visit http://www.google.com/support/scholar/bin/request.py to -ask Google Scholar to crawl your Web site containing scholarly -articles. -10 Google Scholar also indexes other file types, such as -PostScript (*.ps), Microsoft Word (*.doc), and MS PowerPoint -(*.ppt). Here we focus on PDF, which is the most common -format for scientific articles. -11 Google Trends http://www.google.com/trends -Google Insights http://www.google.com/insights/search/ -It might be wise not to select those keywords that are most -popular. It is usually a good idea to query the common academic -search engines using each proposed keyword; if the search -already returns hundreds of documents, it may be better to -choose another keyword with less competition. 12 -3.2 Writing Your Article -Once the keywords are chosen, they need to be mentioned in the -right places: in the title, and as often as possible in the abstract -and the body of the text (but, of course, not so often as to annoy -readers). Although in general titles should be fairly short, we -suggest choosing a longer title if there are many relevant -keywords. -Synonyms of important keywords should also be mentioned a few -times in the body of the text, so that the article may be found by -someone who does not know the most common terminology used -in the research field. If possible, synonyms should also be -mentioned in the abstract, particularly because some academic -search engines do not index the document’s full text. -Be consistent in spelling people’s names, taking special care -with names that contain special characters. If names are used -inconsistently, search engines may not be able to identify articles -or citations correctly; as a consequence, citations may be -assigned incorrectly, and articles will not be as highly ranked as -they could be. For instance, Jöran, Joeran, and Joran are all -correct spellings of the same name (given different transcription -rules), but Google Scholar sees them as three different names. -The article should use a common scientific layout and structure, -including standard sections: introduction, related work, results, -and so on. A common scientific layout and structure will help -Web-based academic search engines to identify an article as -scientific. -Academic search engines, and especially Google Scholar, assign -significant weight to citation counts. Citations influence whether -articles are indexed at al l, and they also influence the ranking of -articles. We do not want to encourage readers to build ‘citation -circles,’ or to take any other unethical action. But any published -articles you have read that relate to your current research paper -should be cited. When referencing your own published work, it is -important to include a link where that work can be downloaded. -This helps readers to find your article and helps academic search -engines to index the referenced article’s full text. Of course, this -can also be done for other articles that have well-known (i.e., -stable and possibly canonical) download locations. -3.3 Preparing for Publication -Text in figures and tables should be machine readable (i.e., -vector graphics containing font-based text should be used instead -Google Adwords -https://adwords.google.com/select/KeywordToolExternal; -Google keyword tool, http://google.com/sktool/ -Spacky, http://www.spacky.com -12 For example, keywords such as ‘Web’ and ‘HTML’ may be of -limited use because there are too many papers published in that -space, in which case it makes more sense to narrow the scope -and choose better-differentiated keywords. -of rasterized images) so that it can easily be indexed by academic -search engines. Vector graphics also look more professional, and -are more user friendly, than raster/bitmap graphics. Graphics -stored as JPEG, BMP, GIF, TIFF, or PNG files are not vector -graphics. -When documents are converted to PDF, all metadata should be -correct (especially author and title). Some search engines use -PDF metadata to identify the file or to display information about -the article on the search results page. It may also be beneficial to -give a meaningful file name to each article. -3.4 Publishing -As part of the optimization process, authors should consider the -journal’s or publisher’s policies. Open-access articles usually -receive more citations than articles accessible only by purchase -or subscription; and, obviously, only articles that are available on -the Web can be indexed by Web-based academic search engines. -Accordingly, when selecting a journal or publisher for -submission, authors should favor those that cooperate with -Google Scholar and other academic search engines, since the -article will potentially obtain more readers and receive more -citations. 13 If a journal does not publish online, authors should -favor publishers who at least allow authors to put their articles -on their or their institutions’ home pages. -3.5 Follow-Up -There are three ways to optimize articles for academic search -engines after publication. -The first is to publish the article on the author’s home page, so -that Web-based academic search engines can find and index it -even if the journal or publisher does not publish the article -online. An author who does not have a Web page might post -articles on an institutional Web page or upload it to a site such as -Sciplore.org, which offers researchers a personal publications -home page that is regularly crawled by Google Scholar (and, of -course, by SciPlore Search). However, it is important to -determine that posting or uploading the article does not -constitute a violation of the author’s agreement with the -publisher. -Second, an article that includes outdated words might be -replaced by either updating the existing article or publishing a -new version on the author’s home page. Google Scholar, at least, -considers all versions of an article available on the Web. We -consider this a good way of making older articles easier to find. -However, this practice may also violate your publisher’s -copyright policy, and it may also be considered misbehavior by -other researchers. It could also be a risky strategy: at some point -in the future, search engines may come to classify this practice as -spamming. In any case, updated articles should be clearly labeled -as such, so that readers are aware that they are reading a -modified version. -Third, it is important to create meaningful parent Web pages for -PDF files. This means that Web pages that link to the PDF file -should mention the most important keywords and the PDFs -13 The main criteria for selecting a publisher or journal, of -course, should still be its reputation and its general suitability -for the paper. The policy is to be seen as an additional factor. -metadata (title, author, and abstract). We do not know whether -any academic search engines are considering these data yet, but -normal search engines do consider them, and it seems only a -matter of time before academic search engines do, too. -4. DISCUSSION -As was true in the beginning for classic SEO, there are some -reservations about ASEO in the academic community. When we -submitted our study about Google Scholar’s ranking algorithm -[2] to a conference, it was rejected. One reviewer provided the -following feedback: -I’m not a big fan of this area of research [...]. I know it’s in -the call for papers, but I think that’s a mistake. -A second reviewer wrote, -[This] paper seems to encourage scientific paper authors to -learn Google scholar’s ranking method and write papers -accordingly to boost ranking [which is not] acceptable to -scientific communities which are supposed to advocate true -technical quality/impact instead of ranking. -ASEO should not be seen as a guide on how to cheat academic -search engines. Rather, it is about helping academic search -engines to understand the content of research papers and, thus, -about how to make this content more widely and easily available. -Certainly, we can anticipate that some researchers will try to -boost their rankings in illegitimate ways. However, the same -problem exists in regular Web searching; and eventually Web -search engines manage to avoid spam with considerable success, -and so will academic search engines. In the long term, ASEO -will be beneficial for all – authors, search engines, and users of -search engines. Therefore, we believe that academic search -engine optimization (ASEO) should be a common procedure for -researchers, similar to, for instance, selecting an appropriate -journal for publication. -ACKNOWLEDGEMENTS -We thank the SEO Bert van Heerde from Insyde -(http://www.insyde.nl/) for his valuable feedback, and Barbara -Shahin for proofreading this article. -ABOUT THE AUTHORS -The research career of Jöran Beel and Bela Gipp began about ten -years ago when they won second prize in Jugend Forscht, -Germany’s largest and most reputable youth science competition -and received awards from, among others, German Chancellor -Gerhard Schröder for their outstanding research work. In 2007, -they graduated with distinction at OVGLI Magdeburg, Germany, -in the field of computer science. They now work for the VLBA- -Lab and are PhD students, currently at LIC Berkeley as visiting -student researchers. During the past years they have published -several papers about academic search engines and research paper -recommender systems. -Erik Wilde is Adjunct Professor at the LIC Berkeley School of -Information. He began his work in Web technologies and Web -architectures a little over ten years ago by publishing the first -book providing a complete overview of Web technologies. After -focusing for some years on XML technologies, XML and -modelling, mapping issues between XML and non-tree -metamodels, and XML-centric design of applications and data -models, he has recently shifted his main focus to information and -application architecture, mobile applications, geo-location issues -on the Web, and how to design data sharing that is open and -accessible for many different service consumers. -REFERENCES -[1] Jöran Beel and Bela Gipp. Google Scholar’s Ranking -Algorithm: The Impact of Citation Counts (An Empirical Study). -In André Flory and Martine Collard, editors, Proceedings of the -3rd IEEE International Conference on Research Challenges in -]nfoO atYon ScY6nc6F�&]���1 �, pages 439–446, Fez (Morocco), -April 2009. IEEE. doi: 10.1109/RCIS.2009.5089308. ISBN 978- -1-4244-2865-6. Available on http://www.sciplore.org. -[2] Jöran Beel and Bela Gipp. Google Scholar’s Ranking -Algorithm: An Introductory Overview. In Birger Larsen and -Jacqueline Leta, editors, Proceedings of the 12th International -&onf6O6nc6 oFH�Y6�Q 6�OYHQnG ]�HO 6�OYV �]��]��1 �, -volume 1, pages 230–241, Rio de Janeiro (Brazil), July 2009. -International Society for Scientometrics and Informetri cs. ISSN -2175-1935. Available on http://www.sciplore.org. -[3] Jöran Beel and Bela Gipp. Google Scholar’s Ranking -Algorithm: The Impact of Articles’ Age (An Empirical Study). In -Shahram Latifi, editor, Proceedings of the 6th International -Conference on Information Technology: New Generations -�]�71*��1 �, pages 160–164, Las Vegas (USA), April 2009. IEEE. -doi: 10.1109/ITNG.2009.317. ISBN 978-1424437702. Available -on http://www.sciplore.org. -[4] Google. Google’s Search Engine Optimization Starter Guide. -PDF, November 2008. URL http://www.google.com/- -webmasters/docs/search-engine-optimization-starter-guide. pdf. -[5] Albert Bifet and Carlos Castillo. An Analysis of Factors Used -in Search Engine Ranking. In Proceedings of the 14th -International World Wide Web Conference (WWW2005), First -International Workshop on Adversarial Information Retrieval on -t 56 11 6b ��]�11 6�����, 2005. -http://airweb.cse.lehigh.edu/2005/bifet.pdf. -[6] Michael P. Evans. Analysing Google rankings through search -engine optimization data. Internet Research, 17 (1): 21–37, 2007. -doi: 10.1108/10662240710730470. -[7] Jin Zhang and Alexandra Dimitroff. The impact of metadata -implementation on webpage visibility in search engine results -(Part II). Cross-Language Information Retrieval, 41 (3): 691– -715, May 2005. -[8] Harold Davis. Search Engine Optimization. O’Reilly, 2006. -[9] Jennifer Grappone and Gradiva Couzin. Search Engine -Optimization: An Hour a Day. John Wiley and Sons, 2nd edition, -2008. -[10] Peter Kent. Search engine optimization for dummies. Willey -Publishing Inc, 2006. -[11] AA Benczur, K Csalogány, T Sarlós, and M Uher. -SpamRank – Fully Automatic Link Spam Detection. In -AGv6OsaOYal ]nDO atYoRR6tOY6vaRQ❑ 56 11 6b ��Y�11 ���Q�, -2005. -[12] A. Benczúr, K. Csalogány, and T. Sarlós. Link-based -similarity search to fight web spam. Adversarial Information -Retrieval on the Web (AIR WEB), Seattle, Washington, USA, -2006. -[13] I. Drost and T. Scheffer. Thwarting the nigritude -ultramarine: Learning to identify link spam. Lecture Notes in -Computer Science, 3720: 96, 2005. -[14] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, -and statistics: Using statistical analysis to locate spam web -pages. pages 1–6, 2004. -[ 15] Q. Gan and T. Suel . Improving web spam classifiers using -link structure. In Proceedings of the 3rd international workshop -on Adversarial information retrieval on the web, page 20. ACM, -2007. -[ 16] Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In -Proceedings of the 31st international conference on Very large -data bases, page 528. VLDB Endowment, 2005. -[17] H. Saito, M. Toyoda, M. Kitsuregawa, and K. Aihara. A -large-scale study of link spam detection by graph algorithms. In -Proceedings of the 3rd international workshop on Adversarial -information retrieval on the web, page 48. ACM, 2007. -[ 18] B. Wu and K. Chel lapilla. Extracting link spam using biased -random walks from spam seed sets. In Proceedings of the 3rd -international workshop on Adversarial information retrieval on -the web, page 44. ACM, 2007. -[19] C. Castillo, D. Donato, A. Gionis, V. Murdock, and -F. Silvestri. Know your neighbors: Web spam detection using the -web topology. In Proceedings of the 30th annual international -ACM SIGIR conference on Research and development in -information retrieval, page 430. ACM, 2007. -[20] G.G. Geng, C.H. Wang, and Q.D. Li. Improving -Spamdexing Detection Via a Two-Stage Classification Strategy. -page 356, 2008. -[21] I.S. Nathenson. Internet infoglut and invisible ink: -Spamdexing search engines with meta tags. Harv. J. Law & Tec, -12: 43–683, 1998. -[22] T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. -Tracking web spam with HTM L style similarities. ACM -Transactions on the Web (TWEB), 2, 2008. -[23] T. Urvoy, T. Lavergne, and P. Filoche. Tracking web spam -with hidden style similarity. In AIRWeb 2006, page 25, 2006. -[24] Masahiro Kimura, Kazumi Saito, Kazuhiro Kazama, and -Shin ya Sato. Detecting Search Engine Spam from a Trackback -Network in Blogspace. Lecture Notes in Computer Science: -Knowledge-Based Intelligent Information and Engineering -Systems, 3684: 723–729, 2005. doi: 10.1007/11554028_101. -[25] Alexandros Ntoulas, Marc Najork, Mark Manasse, and -Dennis Fetterly. Detecting spam web pages through content -analysis. In 15th International Conference on World Wide Web, -pages 83–92. ACM, 2006. -[26] Baoning Wu and Brian D. Davison. Identifying link farm -spam pages. In 14th International Conference on World Wide -Web, pages 820–829, 2005. -[27] Yahoo! How do I improve the ranking of my web site in the -search results?, July 2007. URL http://help.yahoo.com/l/us/- -yahoo/search/indexing/ranking-02.html. -[28] Alex Chitu. Google’s Market Share in Your Country. -Website, March 2009. URL http://googlesystem.blogspot.com/- -2009/03/googles-market-share-in-your-country.html https://- -spreadsheets.google.com/- -ccc?key=pLaE9tsVLp_0y1 FKWBCKGBA. -[29] D. Lewandowski and P. Mayr. Exploring the academic -invisible web. Library Hi Tech, 24 (4) : 529–539, 2006. -[30] Nisa Bakkalbasi, Kathleen Bauer, Janis Glover, and Lei -Wang. Three options for citation tracking: Google Scholar, -Scopus and Web of Science. Biomedical Digital Libraries, 3, -2006. doi : 10.1186/1742-5581-3-7. -[31] John J. Meier and Thomas W. Conkling. Google Scholar’s -Coverage of the Engineering Literature: An Empirical Study. The -Journal of Academic Librarianship, 34 (34): 196–201, 2008. -[32] William H. Walters. Google Scholar coverage of a -multidisciplinary field. Information Processing & Management, -43 (4) : 1121–1132, July 2007. doi : -doi :10.1016/j . ipm.2006.08.006. -[33] Google. About Google Scholar. Website, 2008. URL http://- -scholar.google.com/intl/en/scholar/about.html. -[34] Bert van Heerde. RE: Pre-print: Academic Search Engine -Optimization. Email, 3 September 2009. -[35] Google Scholar. Support for Scholarly Publishers. Website, -2009. URL http://scholar.google.com/intl/en/scholar/- -publishers.html. -[36] S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 -extension to multiple weighted fields. In Proceedings of the -thirteenth ACM international conference on Information and -knowledge management, pages 42–49. ACM New York, NY, -USA, 2004. diff --git a/bin/34_1273675500_P09-1038.body b/bin/34_1273675500_P09-1038.body deleted file mode 100644 index e56a91c..0000000 --- a/bin/34_1273675500_P09-1038.body +++ /dev/null @@ -1,771 +0,0 @@ -Phrase-Based Statistical Machine Translation as a Traveling Salesman -Problem -Mikhail Zaslavskiy* Marc Dymetman Nicola Cancedda - Mines ParisTech, Institut Curie Xerox Research Centre Europe - 77305 Fontainebleau, France 38240 Meylan, France - mikhail.zaslavskiy@ensmp.fr {marc.dymetman,nicola.cancedda}@xrce.xerox.com -Abstract -An efficient decoding algorithm is a cru- -cial element of any statistical machine -translation system. Some researchers have -noted certain similarities between SMT -decoding and the famous Traveling Sales- -man Problem; in particular (Knight, 1999) -has shown that any TSP instance can be -mapped to a sub-case of a word-based -SMT model, demonstrating NP-hardness -of the decoding task. In this paper, we fo- -cus on the reverse mapping, showing that -any phrase-based SMT decoding problem -can be directly reformulated as a TSP. The -transformation is very natural, deepens our -understanding of the decoding problem, -and allows direct use of any of the pow- -erful existing TSP solvers for SMT de- -coding. We test our approach on three -datasets, and compare a TSP-based de- -coder to the popular beam-search algo- -rithm. In all cases, our method provides -competitive or better performance. -1 Introduction -Phrase-based systems (Koehn et al., 2003) are -probably the most widespread class of Statistical -Machine Translation systems, and arguably one of -the most successful. They use aligned sequences -of words, called biphrases, as building blocks for -translations, and score alternative candidate trans- -lations for the same source sentence based on a -log-linear model of the conditional probability of -target sentences given the source sentence: -p(T, a15) = 1 -ZS exp 1:Akhk(5, a, T) (1) -k -where the hk are features, that is, functions of the -source string 5, of the target string T, and of the -* This work was conducted during an internship at -XRCE. -alignment a, where the alignment is a representa- -tion of the sequence of biphrases that where used -in order to build T from 5; The �k’s are weights -and ZS is a normalization factor that guarantees -that p is a proper conditional probability distri- -bution over the pairs (T, A). Some features are -local, i.e. decompose over biphrases and can be -precomputed and stored in advance. These typ- -ically include forward and reverse phrase condi- -tional probability features log p(�t1s) as well as -logp(s1�t), where 9 is the source side of the -biphrase and t� the target side, and the so-called -“phrase penalty” and “word penalty” features, -which count the number of phrases and words in -the alignment. Other features are non-local, i.e. -depend on the order in which biphrases appear in -the alignment. Typical non-local features include -one or more n-gram language models as well as -a distortion feature, measuring by how much the -order of biphrases in the candidate translation de- -viates from their order in the source sentence. -Given such a model, where the �Z’s have been -tuned on a development set in order to minimize -some error rate (see e.g. (Lopez, 2008)), together -with a library of biphrases extracted from some -large training corpus, a decoder implements the -actual search among alternative translations: -(a*, T*) = arg max -(a,T) -The decoding problem (2) is a discrete optimiza- -tion problem. Usually, it is very hard to find the -exact optimum and, therefore, an approximate so- -lution is used. Currently, most decoders are based -on some variant of a heuristic left-to-right search, -that is, they attempt to build a candidate translation -(a, T) incrementally, from left to right, extending -the current partial translation at each step with a -new biphrase, and computing a score composed of -two contributions: one for the known elements of -the partial translation so far, and one a heuristic -P(T, a15). (2) -333 -Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 333–341, -Suntec, Singapore, 2-7 August 2009. c�2009 ACL and AFNLP -estimate of the remaining cost for completing the -translation. The variant which is mostly used is -a form of beam-search, where several partial can- -didates are maintained in parallel, and candidates -for which the current score is too low are pruned -in favor of candidates that are more promising. -We will see in the next section that some char- -acteristics of beam-search make it a suboptimal -choice for phrase-based decoding, and we will -propose an alternative. This alternative is based on -the observation that phrase-based decoding can be -very naturally cast as a Traveling Salesman Prob- -lem (TSP), one of the best studied problems in -combinatorial optimization. We will show that this -formulation is not only a powerful conceptual de- -vice for reasoning on decoding, but is also prac- -tically convenient: in the same amount of time, -off-the-shelf TSP solvers can find higher scoring -solutions than the state-of-the art beam-search de- -coder implemented in Moses (Hoang and Koehn, -2008). -2 Related work -Beam-search decoding -In beam-search decoding, candidate translation -prefixes are iteratively extended with new phrases. -In its most widespread variant, stack decoding, -prefixes obtained by consuming the same number -of source words, no matter which, are grouped to- -gether in the same stack1 and compete against one -another. Threshold and histogram pruning are ap- -plied: the former consists in dropping all prefixes -having a score lesser than the best score by more -than some fixed amount (a parameter of the algo- -rithm), the latter consists in dropping all prefixes -below a certain rank. -While quite successful in practice, stack decod- -ing presents some shortcomings. A first one is that -prefixes obtained by translating different subsets -of source words compete against one another. In -one early formulation of stack decoding for SMT -(Germann et al., 2001), the authors indeed pro- -posed to lazily create one stack for each subset -of source words, but acknowledged issues with -the potential combinatorial explosion in the num- -ber of stacks. This problem is reduced by the use -of heuristics for estimating the cost of translating -the remaining part of the source sentence. How- -1While commonly adopted in the speech and SMT com- -munities, this is a bit of a misnomer, since the used data struc- -tures are priority queues, not stacks. -ever, this solution is only partially satisfactory. On -the one hand, heuristics should be computationally -light, much lighter than computing the actual best -score itself, while, on the other hand, the heuris- -tics should be tight, as otherwise pruning errors -will ensue. There is no clear criterion to guide -in this trade-off. Even when good heuristics are -available, the decoder will show a bias towards -putting at the beginning the translation of a certain -portion of the source, either because this portion -is less ambiguous (i.e. its translation has larger -conditional probability) or because the associated -heuristics is less tight, hence more optimistic. Fi- -nally, since the translation is built left-to-right the -decoder cannot optimize the search by taking ad- -vantage of highly unambiguous and informative -portions that should be best translated far from the -beginning. All these reasons motivate considering -alternative decoding strategies. -Word-based SMT and the TSP -As already mentioned, the similarity between -SMT decoding and TSP was recognized in -(Knight, 1999), who focussed on showing that -any TSP can be reformulated as a sub-class of the -SMT decoding problem, proving that SMT decod- -ing is NP-hard. Following this work, the exis- -tence of many efficient TSP algorithms then in- -spired certain adaptations of the underlying tech- -niques to SMT decoding for word-based models. -Thus, (Germann et al., 2001) adapt a TSP sub- -tour elimination strategy to an IBM-4 model, us- -ing generic Integer Programming techniques. The -paper comes close to a TSP formulation of de- -coding with IBM-4 models, but does not pursue -this route to the end, stating that “It is difficult -to convert decoding into straight TSP, but a wide -range of combinatorial optimization problems (in- -cluding TSP) can be expressed in the more gen- -eral framework of linear integer programming”. -By employing generic IP techniques, it is how- -ever impossible to rely on the variety of more -efficient both exact and approximate approaches -which have been designed specifically for the TSP. -In (Tillmann and Ney, 2003) and (Tillmann, 2006), -the authors modify a certain Dynamic Program- -ming technique used for TSP for use with an IBM- -4 word-based model and a phrase-based model re- -spectively. However, to our knowledge, none of -these works has proposed a direct reformulation -of these SMT models as TSP instances. We be- -lieve we are the first to do so, working in our case -334 -with the mainstream phrase-based SMT models, -and therefore making it possible to directly apply -existing TSP solvers to SMT. -3 The Traveling Salesman Problem and -its variants -In this paper the Traveling Salesman Problem ap- -pears in four variants: -STSP. The most standard, and most studied, -variant is the Symmetric TSP: we are given a non- -directed graph G on N nodes, where the edges -carry real-valued costs. The STSP problem con- -sists in finding a tour of minimal total cost, where -a tour (also called Hamiltonian Circuit) is a “cir- -cular” sequence of nodes visiting each node of the -graph exactly once; -ATSP. The Asymmetric TSP, or ATSP, is a vari- -ant where the underlying graph G is directed and -where, for i and j two nodes of the graph, the -edges (i,j) and (j,i) may carry different costs. -SGTSP. The Symmetric Generalized TSP, or -SGTSP: given a non-oriented graph G of JGJ -nodes with edges carrying real-valued costs, given -a partition of these JGJ nodes into m non-empty, -disjoint, subsets (called clusters), find a circular -sequence of m nodes of minimal total cost, where -each cluster is visited exactly once. -AGTSP. The Asymmetric Generalized TSP, or -AGTSP: similar to the SGTSP, but G is now a di- -rected graph. -The STSP is often simply denoted TSP in the -literature, and is known to be NP-hard (Applegate -et al., 2007); however there has been enormous -interest in developing efficient solvers for it, both -exact and approximate. -Most of existing algorithms are designed for -STSP, but ATSP, SGTSP and AGTSP may be re- -duced to STSP, and therefore solved by STSP al- -gorithms. -3.1 Reductions AGTSP—*ATSP—*STSP -The transformation of the AGTSP into the ATSP, -introduced by (Noon and Bean, 1993)), is illus- -trated in Figure (1). In this diagram, we assume -that Y1, ... , YK are the nodes of a given cluster, -while X and Z are arbitrary nodes belonging to -other clusters. In the transformed graph, we in- -troduce edges between the Y�’s in order to form a -cycle as shown in the figure, where each edge has -a large negative cost —K. We leave alone the in- -coming edge to Y� from X, but the outgoing edge -Figure 1: AGTSP—*ATSP. -from Y� to X has its origin changed to Y�_1. A -feasible tour in the original AGTSP problem pass- -ing through X, Y�, Z will then be “encoded” as a -tour of the transformed graph that first traverses -X , then traverses Y�, ... , YK, ... , Y�_1, then tra- -verses Z (this encoding will have the same cost as -the original cost, minus (k — 1)K). Crucially, if -K is large enough, then the solver for the trans- -formed ATSP graph will tend to traverse as many -K edges as possible, meaning that it will traverse -exactly k — 1 such edges in the cluster, that is, it -will produce an encoding of some feasible tour of -the AGTSP problem. -As for the transformation ATSP—*STSP, several -variants are described in the literature, e.g. (Ap- -plegate et al., 2007, p. 126); the one we use is from -(Wikipedia, 2009) (not illustrated here for lack of -space). -3.2 TSP algorithms -TSP is one of the most studied problems in com- -binatorial optimization, and even a brief review of -existing approaches would take too much place. -Interested readers may consult (Applegate et al., -2007; Gutin, 2003) for good introductions. -One of the best existing TSP solvers is imple- -mented in the open source Concorde package (Ap- -plegate et al., 2005). Concorde includes the fastest -exact algorithm and one of the most efficient im- -plementations of the Lin-Kernighan (LK) heuris- -tic for finding an approximate solution. LK works -by generating an initial random feasible solution -for the TSP problem, and then repeatedly identi- -fying an ordered subset of k edges in the current -tour and an ordered subset of k edges not included -in the tour such that when they are swapped the -objective function is improved. This is somewhat -335 -reminiscent of the Greedy decoding of (Germann -et al., 2001), but in LK several transformations can -be applied simultaneously, so that the risk of being -stuck in a local optimum is reduced (Applegate et -al., 2007, chapter 15). -As will be shown in the next section, phrase- -based SMT decoding can be directly reformulated -as an AGTSP. Here we use Concorde through -first transforming AGTSP into STSP, but it might -also be interesting in the future to use algorithms -specifically designed for AGTSP, which could im- -prove efficiency further (see Conclusion). -4 Phrase-based Decoding as TSP -In this section we reformulate the SMT decoding -problem as an AGTSP. We will illustrate the ap- -proach through a simple example: translating the -French sentence “cette traduction automatique est -curieuse ” into English. We assume that the rele- -vant biphrases for translating the sentence are as -follows: -ID -source -target -h -cette -this -t -traduction -translation -ht -cette traduction -this translation -mt -traduction automatique -machine translation -a -automatique -automatic -m -automatique -machine -i -est -is -s -curieuse -strange -c -curieuse -curious -Under this model, we can produce, among others, -the following translations: -h mt i s this machine translation is strange -h c t i a this curious translation is automatic -ht s i a this translation strange is automatic -where we have indicated on the left the ordered se- -quence of biphrases that leads to each translation. -We now formulate decoding as an AGTSP, in -the following way. The graph nodes are all the -possible pairs (w, b), where w is a source word in -the source sentence s and b is a biphrase contain- -ing this source word. The graph clusters are the -subsets of the graph nodes that share a common -source word w. -The costs of a transition between nodes M and -N of the graph are defined as follows: -(a) If M is of the form (w, b) and N of the form -(w', b), in which b is a single biphrase, and w and -w' are consecutive words in b, then the transition -cost is 0: once we commit to using the first word -of b, there is no additional cost for traversing the -other source words covered by b. -(b) If M = (w, b), where w is the rightmost -source word in the biphrase b, and N = (w', b'), -where w' =� w is the leftmost source word in b', -then the transition cost corresponds to the cost -of selecting b' just after b; this will correspond -to “consuming” the source side of b' after having -consumed the source side of b (whatever their rel- -ative positions in the source sentence), and to pro- -ducing the target side of b' directly after the target -side of b; the transition cost is then the addition of -several contributions (weighted by their respective -A (not shown), as in equation 1): -• The cost associated with the features local to -b in the biphrase library; -• The “distortion” cost of consuming the -source word w' just after the source word w: -1pos(w') — pos(w) — 11, where pos(w) and -pos(w') are the positions of w and w' in the -source sentence. -• The language model cost of producing the -target words of b' right after the target words -of b; with a bigram language model, this cost -can be precomputed directly from b and b'. -This restriction to bigram models will be re- -moved in Section 4.1. -(c) In all other cases, the transition cost is infinite, -or, in other words, there is no edge in the graph -between M and N. -A special cluster containing a single node (de- -noted by $-$$ in the figures), and corresponding to -special beginning-of-sentence symbols must also -be included: the corresponding edges and weights -can be worked out easily. Figures 2 and 3 give -some illustrations of what we have just described. -4.1 From Bigram to N-gram LM -Successful phrase-based systems typically employ -language models of order higher than two. How- -ever, our models so far have the following impor- -tant “Markovian” property: the cost of a path is -additive relative to the costs of transitions. For -example, in the example of Figure 3, the cost of -this • machine translation • is • strange, can only -take into account the conditional probability of the -word strange relative to the word is, but not rela- -tive to the words translation and is. If we want to -extend the power of the model to general n-gram -language models, and in particular to the 3-gram -336 -Figure 2: Transition graph for the source sentence -cette traduction automatique est curieuse. Only -edges entering or exiting the node traduction — mt -are shown. The only successor to [traduction — -mt] is [automatique — mt], and [cette — ht] is not a -predecessor of [traduction — mt]. -Figure 3: A GTSP tours is illustrated, correspond- -ing to the displayed output. -case (on which we concentrate here, but the tech- -niques can be easily extended to the general case), -the following approach can be applied. -Compiling Out for Trigram models -This approach consists in “compiling out” all -biphrases with a target side of only one word. -We replace each biphrase b with single-word tar- -get side by “extended” biphrases bi, ... , br, which -are “concatenations” of b and some other biphrase -b� in the library.2 To give an example, consider -that we: (1) remove from the biphrase library the -biphrase i, which has a single word target, and (2) -add to the library the extended biphrases mti, ti, -si, ..., that is, all the extended biphrases consist- -ing of the concatenation of a biphrase in the library -with i, then it is clear that these extended biphrases -will provide enough context to compute a trigram -probability for the target word produced immedi- -ately next (in the examples, for the words strange, -2In the figures, such “concatenations” are denoted by -[b' • b] ; they are interpreted as encapsulations of first con- -suming the source side of b', whether or not this source side -precedes the source side of b in the source sentence, produc- -ing the target side of b', consuming the source side of b, and -producing the target side of b immediately after that of b'. -Figure 4: Compiling-out of biphrase i: (est,is). -automatic and automatic respectively). If we do -that exhaustively for all biphrases (relevant for the -source sentence at hand) that, like i, have a single- -word target, we will obtain a representation that -allows a trigram language model to be computed -at each point. -The situation becomes clearer by looking at Fig- -ure 4, where we have only eliminated the biphrase -i, and only shown some of the extended biphrases -that now encapsulate i, and where we show one -valid circuit. Note that we are now able to as- -sociate with the edge connecting the two nodes -(est, mti) and (curieuse, s) a trigram cost because -mti provides a large enough target context. -While this exhaustive “compiling out” method -works in principle, it has a serious defect: if for -the sentence to be translated, there are m relevant -biphrases, among which k have single-word tar- -gets, then we will create on the order of km ex- -tended biphrases, which may represent a signif- -icant overhead for the TSP solver, as soon as k -is large relative to m, which is typically the case. -The problem becomes even worse if we extend the -compiling-out method to n-gram language models -with n > 3. In the Future Work section below, -we describe a powerful approach for circumvent- -ing this problem, but with which we have not ex- -perimented yet. -5 Experiments -5.1 Monolingual word re-ordering -In the first series of experiments we consider the -artificial task of reconstructing the original word -order of a given English sentence. First, we ran- -domly permute words in the sentence, and then -we try to reconstruct the original order by max- -337 -Time (sec) -−0.1 -−0.2 -−0.3 -−0.4100 102 104 -0.1 -0 -BEAM−SEARCH -TSP -−0.8100 102 104 -Time (sec) -0.2 -BEAM−SEARCH -TSP -0 -−0.2 -−0.4 -−0.6 -(a) (b) (c) (d) -Figure 5: (a), (b): LM and BLEU scores as functions of time for a bigram LM; (c), (d): the same for -a trigram LM. The x axis corresponds to the cumulative time for processing the test set; for (a) and (c), -the y axis corresponds to the mean difference (over all sentences) between the lm score of the output -and the lm score of the reference normalized by the sentence length N: (LM(ref)-LM(true))/N. The solid -line with star marks corresponds to using beam-search with different pruning thresholds, which result in -different processing times and performances. The cross corresponds to using the exact-TSP decoder (in -this case the time to the optimal solution is not under the user’s control). -imizing the LM score over all possible permuta- -tions. The reconstruction procedure may be seen -as a translation problem from “Bad English” to -“Good English”. Usually the LM score is used -as one component of a more complex decoder -score which also includes biphrase and distortion -scores. But in this particular “translation task” -from bad to good English, we consider that all -“biphrases” are of the form e — e, where e is an -English word, and we do not take into account -any distortion: we only consider the quality of -the permutation as it is measured by the LM com- -ponent. Since for each “source word” e, there is -exactly one possible “biphrase” e — e each clus- -ter of the Generalized TSP representation of the -decoding problem contains exactly one node; in -other terms, the Generalized TSP in this situation -is simply a standard TSP. Since the decoding phase -is then equivalent to a word reordering, the LM -score may be used to compare the performance -of different decoding algorithms. Here, we com- -pare three different algorithms: classical beam- -search (Moses); a decoder based on an exact TSP -solver (Concorde); a decoder based on an approx- -imate TSP solver (Lin-Kernighan as implemented -in the Concorde solver) 3. In the Beam-search -and the LK-based TSP solver we can control the -trade-off between approximation quality and run- -ning time. To measure re-ordering quality, we use -two scores. The first one is just the “internal” LM -score; since all three algorithms attempt to maxi- -mize this score, a natural evaluation procedure is -to plot its value versus the elapsed time. The sec- -3 Both TSP decoders may be used with/or without a distor- -tion limit; in our experiments we do not use this parameter. -ond score is BLEU (Papineni et al., 2001), com- -puted between the reconstructed and the original -sentences, which allows us to check how well the -quality of reconstruction correlates with the inter- -nal score. The training dataset for learning the LM -consists of 50000 sentences from NewsCommen- -tary corpus (Callison-Burch et al., 2008), the test -dataset for word reordering consists of 170 sen- -tences, the average length of test sentences is equal -to 17 words. -Bigram based reordering. First we consider -a bigram Language Model and the algorithms try -to find the re-ordering that maximizes the LM -score. The TSP solver used here is exact, that is, -it actually finds the optimal tour. Figures 5(a,b) -present the performance of the TSP and Beam- -search based methods. -Trigram based reordering. Then we consider -a trigram based Language Model and the algo- -rithms again try to maximize the LM score. The -trigram model used is a variant of the exhaustive -compiling-out procedure described in Section 4.1. -Again, we use an exact TSP solver. -Looking at Figure 5a, we see a somewhat sur- -prising fact: the cross and some star points have -positive y coordinates! This means that, when us- -ing a bigram language model, it is often possible -to reorder the words of a randomly permuted ref- -erence sentence in such a way that the LM score -of the reordered sentence is larger than the LM of -the reference. A second notable point is that the -increase in the LM-score of the beam-search with -time is steady but very slow, and never reaches the -level of performance obtained with the exact-TSP -procedure, even when increasing the time by sev- -338 -eral orders of magnitude. Also to be noted is that -the solution obtained by the exact-TSP is provably -the optimum, which is almost never the case of -the beam-search procedure. In Figure 5b, we re- -port the BLEU score of the reordered sentences -in the test set relative to the original reference -sentences. Here we see that the exact-TSP out- -puts are closer to the references in terms of BLEU -than the beam-search solutions. Although the TSP -output does not recover the reference sentences -(it produces sentences with a slightly higher LM -score than the references), it does reconstruct the -references better than the beam-search. The ex- -periments with trigram language models (Figures -5(c,d)) show similar trends to those with bigrams. -5.2 Translation experiments with a bigram -language model -In this section we consider two real translation -tasks, namely, translation from English to French, -trained on Europarl (Koehn et al., 2003) and trans- -lation from German to Spanish training on the -NewsCommentary corpus. For Europarl, the train- -ing set includes 2.81 million sentences, and the -test set 500. For NewsCommentary the training -set is smaller: around 63k sentences, with a test -set of 500 sentences. Figure 6 presents Decoder -and Bleu scores as functions of time for the two -corpuses. -Since in the real translation task, the size of the -TSP graph is much larger than in the artificial re- -ordering task (in our experiments the median size -of the TSP graph was around 400 nodes, some- -times growing up to 2000 nodes), directly apply- -ing the exact TSP solver would take too long; in- -stead we use the approximate LK algorithm and -compare it to Beam-Search. The efficiency of the -LK algorithm can be significantly increased by us- -ing a good initialization. To compare the quality of -the LK and Beam-Search methods we take a rough -initial solution produced by the Beam-Search al- -gorithm using a small value for the stack size and -then use it as initial point, both for the LK algo- -rithm and for further Beam-Search optimization -(where as before we vary the Beam-Search thresh- -olds in order to trade quality for time). -In the case of the Europarl corpus, we observe -that LK outperforms Beam-Search in terms of the -Decoder score as well as in terms of the BLEU -score. Note that the difference between the two al- -gorithms increases steeply at the beginning, which -means that we can significantly increase the qual- -ity of the Beam-Search solution by using the LK -algorithm at a very small price. In addition, it is -important to note that the BLEU scores obtained in -these experiments correspond to feature weights, -in the log-linear model (1), that have been opti- -mized for the Moses decoder, but not for the TSP -decoder: optimizing these parameters relatively to -the TSP decoder could improve its BLEU scores -still further. -On the News corpus, again, LK outperforms -Beam-Search in terms of the Decoder score. The -situation with the BLEU score is more confuse. -Both algorithms do not show any clear score im- -provement with increasing running time which -suggests that the decoder’s objective function is -not very well correlated with the BLEU score on -this corpus. -6 Future Work -In section 4.1, we described a general “compiling -out” method for extending our TSP representation -to handling trigram and N-gram language models, -but we noted that the method may lead to combi- -natorial explosion of the TSP graph. While this -problem was manageable for the artificial mono- -lingual word re-ordering (which had only one pos- -sible translation for each source word), it be- -comes unwieldy for the real translation experi- -ments, which is why in this paper we only consid- -ered bigram LMs for these experiments. However, -we know how to handle this problem in principle, -and we now describe a method that we plan to ex- -periment with in the future. -To avoid the large number of artificial biphrases -as in 4. 1, we perform an adaptive selection. Let us -suppose that (w, b) is a SMT decoding graph node, -where b is a biphrase containing only one word on -the target side. On the first step, when we evaluate -the traveling cost from (w, b) to (w', b'), we take -the language model component equal to -— log p(b'.vIb.e, b''. e), -where b'.v represents the first word of the b' tar- -get side, b. e is the only word of the b target -side, and b''.e is the last word of the b'' tar -get size. This procedure underestimates the total -cost of tour passing through biphrases that have a -single-word target. Therefore if the optimal tour -passes only through biphrases with more than one -min -b"�b',b -339 -−271 -−271.5 -−272 -−272.5 -−273103 104 105 -BEAM−SEARCH -TSP (LK) -Time (sec) -0.19 -0.185 - 0.18 3 4 5 - 10 10 10 -BEAM−SEARCH -TSP (LK) -Time (sec) -−413 -−413.2 -−413.4 -−413.6 -−413.8 - −414 3 4 - 10 10 -TSP (LK) -BEAM−SEARCH -Time (sec) -TSP (LK) -BEAM−SEARCH - 0.242 3 4 - 10 10 -Time (sec) -0.245 -0.244 -0.243 -(a) (b) (c) (d) -Figure 6: (a), (b): Europarl corpus, translation from English to French; (c),(d): NewsCommentary cor- -pus, translation from German to Spanish. Average value of the decoder and the BLEU scores (over 500 -test sentences) as a function of time. The trade-off quality/time in the case of LK is controlled by the -number of iterations, and each point corresponds to a particular number of iterations, in our experiments -LK was run with a number of iterations varying between 2k and 170k. The same trade-off in the case of -Beam-Search is controlled by varying the beam thresholds. -word on their target side, then we are sure that -this tour is also optimal in terms of the tri-gram -language model. Otherwise, if the optimal tour -passes through (w, b), where b is a biphrase hav- -ing a single-word target, we add only the extended -biphrases related to b as we described in section -4. 1, and then we recompute the optimal tour. Iter- -ating this procedure provably converges to an op- -timal solution. -This powerful method, which was proposed in -(Kam and Kopec, 1996; Popat et al., 2001) in the -context of a finite-state model (but not of TSP), -can be easily extended to N-gram situations, and -typically converges in a small number of itera- -tions. -7 Conclusion -The main contribution of this paper has been to -propose a transformation for an arbitrary phrase- -based SMT decoding instance into a TSP instance. -While certain similarities of SMT decoding and -TSP were already pointed out in (Knight, 1999), -where it was shown that any Traveling Salesman -Problem may be reformulated as an instance of -a (simplistic) SMT decoding task, and while cer- -tain techniques used for TSP were then adapted to -word-based SMT decoding (Germann et al., 2001; -Tillmann and Ney, 2003; Tillmann, 2006), we are -not aware of any previous work that shows that -SMT decoding can be directly reformulated as a -TSP. Beside the general interest of this transfor- -mation for understanding decoding, it also opens -the door to direct application of the variety of ex- -isting TSP algorithms to SMT. Our experiments -on synthetic and real data show that fast TSP al- -gorithms can handle selection and reordering in -SMT comparably or better than the state-of-the- -art beam-search strategy, converging on solutions -with higher objective function in a shorter time. -The proposed method proceeds by first con- -structing an AGTSP instance from the decoding -problem, and then converting this instance first -into ATSP and finally into STSP. At this point, a -direct application of the well known STSP solver -Concorde (with Lin-Kernighan heuristic) already -gives good results. We believe however that there -might exist even more efficient alternatives. In- -stead of converting the AGTSP instance into a -STSP instance, it might prove better to use di- -rectly algorithms expressly designed for ATSP -or AGTSP. For instance, some of the algorithms -tested in the context of the DIMACS implemen- -tation challenge for ATSP (Johnson et al., 2002) -might well prove superior. There is also active re- -search around AGTSP algorithms. Recently new -effective methods based on a “memetic” strategy -(Buriol et al., 2004; Gutin et al., 2008) have been -put forward. These methods combined with our -proposed formulation provide ready-to-use SMT -decoders, which it will be interesting to compare. -Acknowledgments -Thanks to Vassilina Nikoulina for her advice about -running Moses on the test datasets. -340 -References diff --git a/bin/34_1273675500_P09-1038.cite b/bin/34_1273675500_P09-1038.cite deleted file mode 100644 index 4d3ba62..0000000 --- a/bin/34_1273675500_P09-1038.cite +++ /dev/null @@ -1,72 +0,0 @@ -David L. Applegate, Robert E. Bixby, Vasek Chvatal, -and William J. Cook. 2005. Concorde -tsp solver. http://www.tsp.gatech.edu/ -concorde.html. -David L. Applegate, Robert E. Bixby, Vasek Chvatal, -and William J. Cook. 2007. The Traveling Sales- -man Problem: A Computational Study (Princeton -Series in Applied Mathematics). Princeton Univer- -sity Press, January. -Luciana Buriol, Paulo M. Franc¸a, and Pablo Moscato. -2004. A new memetic algorithm for the asymmetric -traveling salesman problem. Journal of Heuristics, -10(5):483–506. -Chris Callison-Burch, Philipp Koehn, Christof Monz, -Josh Schroeder, and Cameron Shaw Fordyce, edi- -tors. 2008. Proceedings of the Third Workshop on -SMT. ACL, Columbus, Ohio, June. -Ulrich Germann, Michael Jahr, Kevin Knight, and -Daniel Marcu. 2001. Fast decoding and optimal -decoding for machine translation. In In Proceedings -ofACL 39, pages 228–235. -Gregory Gutin, Daniel Karapetyan, and Krasnogor Na- -talio. 2008. Memetic algorithm for the generalized -asymmetric traveling salesman problem. In NICSO -2007, pages 199–210. Springer Berlin. -G. Gutin. 2003. Travelling salesman and related prob- -lems. In Handbook of Graph Theory. -Hieu Hoang and Philipp Koehn. 2008. Design of the -Moses decoder for statistical machine translation. In -ACL 2008 Software workshop, pages 58–65, Colum- -bus, Ohio, June. ACL. -D.S. Johnson, G. Gutin, L.A. McGeoch, A. Yeo, -W. Zhang, and A. Zverovich. 2002. Experimen- -tal analysis of heuristics for the atsp. In The Trav- -elling Salesman Problem and Its Variations, pages -445–487. -Anthony C. Kam and Gary E. Kopec. 1996. Document -image decoding by heuristic search. IEEE Transac- -tions on Pattern Analysis and Machine Intelligence, -18:945–950. -Kevin Knight. 1999. Decoding complexity in word- -replacement translation models. Computational -Linguistics, 25:607–615. -Philipp Koehn, Franz Josef Och, and Daniel Marcu. -2003. Statistical phrase-based translation. In -NAACL 2003, pages 48–54, Morristown, NJ, USA. -Association for Computational Linguistics. -Adam Lopez. 2008. Statistical machine translation. -ACM Comput. Surv., 40(3):1–49. -C. Noon and J.C. Bean. 1993. An efficient transforma- -tion of the generalized traveling salesman problem. -INFOR, pages 39–44. -Kishore Papineni, Salim Roukos, Todd Ward, and -Wei J. Zhu. 2001. BLEU: a Method for Automatic -Evaluation of Machine Translation. IBM Research -Report, RC22176. -Kris Popat, Daniel H. Greene, Justin K. Romberg, and -Dan S. Bloomberg. 2001. Adding linguistic con- -straints to document image decoding: Comparing -the iterated complete path and stack algorithms. -Christoph Tillmann and Hermann Ney. 2003. Word re- -ordering and a dynamic programming beam search -algorithm for statistical machine translation. Com- -put. Linguist., 29(1):97–133. -Christoph Tillmann. 2006. Efficient Dynamic Pro- -gramming Search Algorithms For Phrase-Based -SMT. In Workshop On Computationally Hard Prob- -lems And Joint Inference In Speech And Language -Processing. -Wikipedia. 2009. Travelling Salesman Problem — -Wikipedia, The Free Encyclopedia. [Online; ac- -cessed 5-May-2009]. \ No newline at end of file diff --git a/bin/34_1273675500_P09-1038.out b/bin/34_1273675500_P09-1038.out deleted file mode 100644 index 63c54b3..0000000 --- a/bin/34_1273675500_P09-1038.out +++ /dev/null @@ -1,312 +0,0 @@ - - - - -Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem -Mikhail Zaslavskiy Marc Dymetman Nicola Cancedda -Mines ParisTech, Institut Curie Xerox Research Centre Europe -
77305 Fontainebleau, France 38240 Meylan, France
-mikhail.zaslavskiy@ensmp.fr{marc.dymetman,nicola.cancedda}@xrce.xerox.com -An efficient decoding algorithm is a crucial element of any statistical machine translation system. Some researchers have noted certain similarities between SMT decoding and the famous Traveling Salesman Problem; in particular (Knight, 1999) has shown that any TSP instance can be mapped to a sub-case of a word-based SMT model, demonstrating NP-hardness of the decoding task. In this paper, we focus on the reverse mapping, showing that any phrase-based SMT decoding problem can be directly reformulated as a TSP. The transformation is very natural, deepens our understanding of the decoding problem, and allows direct use of any of the powerful existing TSP solvers for SMT decoding. We test our approach on three datasets, and compare a TSP-based decoder to the popular beam-search algorithm. In all cases, our method provides competitive or better performance -
-
- - - - -David L Applegate -Robert E Bixby -Vasek Chvatal -William J Cook - -Concorde tsp solver -2005 -http://www.tsp.gatech.edu/ concorde.html - -e too much place. Interested readers may consult (Applegate et al., 2007; Gutin, 2003) for good introductions. One of the best existing TSP solvers is implemented in the open source Concorde package (Applegate et al., 2005). Concorde includes the fastest exact algorithm and one of the most efficient implementations of the Lin-Kernighan (LK) heuristic for finding an approximate solution. LK works by generating an initial - -Applegate, Bixby, Chvatal, Cook, 2005 -David L. Applegate, Robert E. Bixby, Vasek Chvatal, and William J. Cook. 2005. Concorde tsp solver. http://www.tsp.gatech.edu/ concorde.html. - - - -David L Applegate -Robert E Bixby -Vasek Chvatal -William J Cook - -The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics) -2007 -Princeton University Press - -exactly once. AGTSP. The Asymmetric Generalized TSP, or AGTSP: similar to the SGTSP, but G is now a directed graph. The STSP is often simply denoted TSP in the literature, and is known to be NP-hard (Applegate et al., 2007); however there has been enormous interest in developing efficient solvers for it, both exact and approximate. Most of existing algorithms are designed for STSP, but ATSP, SGTSP and AGTSP may be reduc -ch edges in the cluster, that is, it will produce an encoding of some feasible tour of the AGTSP problem. As for the transformation ATSP—*STSP, several variants are described in the literature, e.g. (Applegate et al., 2007, p. 126); the one we use is from (Wikipedia, 2009) (not illustrated here for lack of space). 3.2 TSP algorithms TSP is one of the most studied problems in combinatorial optimization, and even a brief -ewhat 335 reminiscent of the Greedy decoding of (Germann et al., 2001), but in LK several transformations can be applied simultaneously, so that the risk of being stuck in a local optimum is reduced (Applegate et al., 2007, chapter 15). As will be shown in the next section, phrase- based SMT decoding can be directly reformulated as an AGTSP. Here we use Concorde through first transforming AGTSP into STSP, but it might - -Applegate, Bixby, Chvatal, Cook, 2007 -David L. Applegate, Robert E. Bixby, Vasek Chvatal, and William J. Cook. 2007. The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics). Princeton University Press, January. - - - -Luciana Buriol -Paulo M Franc¸a -Pablo Moscato - -A new memetic algorithm for the asymmetric traveling salesman problem -2004 -Journal of Heuristics -10 -Buriol, Franc¸a, Moscato, 2004 -Luciana Buriol, Paulo M. Franc¸a, and Pablo Moscato. 2004. A new memetic algorithm for the asymmetric traveling salesman problem. Journal of Heuristics, 10(5):483–506. - - -2008 -Proceedings of the Third Workshop on SMT. ACL -Chris Callison-Burch, Philipp Koehn, Christof Monz, Josh Schroeder, and Cameron Shaw Fordyce, editors -Columbus, Ohio -2008 -Chris Callison-Burch, Philipp Koehn, Christof Monz, Josh Schroeder, and Cameron Shaw Fordyce, editors. 2008. Proceedings of the Third Workshop on SMT. ACL, Columbus, Ohio, June. - - - -Ulrich Germann -Michael Jahr -Kevin Knight -Daniel Marcu - -Fast decoding and optimal decoding for machine translation. In -2001 -In Proceedings ofACL 39 -228--235 - -ing presents some shortcomings. A first one is that prefixes obtained by translating different subsets of source words compete against one another. In one early formulation of stack decoding for SMT (Germann et al., 2001), the authors indeed proposed to lazily create one stack for each subset of source words, but acknowledged issues with the potential combinatorial explosion in the number of stacks. This problem is re -T decoding is NP-hard. Following this work, the existence of many efficient TSP algorithms then inspired certain adaptations of the underlying techniques to SMT decoding for word-based models. Thus, (Germann et al., 2001) adapt a TSP sub- tour elimination strategy to an IBM-4 model, using generic Integer Programming techniques. The paper comes close to a TSP formulation of decoding with IBM-4 models, but does not purs - current tour and an ordered subset of k edges not included in the tour such that when they are swapped the objective function is improved. This is somewhat 335 reminiscent of the Greedy decoding of (Germann et al., 2001), but in LK several transformations can be applied simultaneously, so that the risk of being stuck in a local optimum is reduced (Applegate et al., 2007, chapter 15). As will be shown in the next sect -own that any Traveling Salesman Problem may be reformulated as an instance of a (simplistic) SMT decoding task, and while certain techniques used for TSP were then adapted to word-based SMT decoding (Germann et al., 2001; Tillmann and Ney, 2003; Tillmann, 2006), we are not aware of any previous work that shows that SMT decoding can be directly reformulated as a TSP. Beside the general interest of this transformation - -Germann, Jahr, Knight, Marcu, 2001 -Ulrich Germann, Michael Jahr, Kevin Knight, and Daniel Marcu. 2001. Fast decoding and optimal decoding for machine translation. In In Proceedings ofACL 39, pages 228–235. - - - -Gregory Gutin -Daniel Karapetyan -Krasnogor Natalio - -Memetic algorithm for the generalized asymmetric traveling salesman problem -2008 -In NICSO 2007 -199--210 -Springer -Berlin - -ge for ATSP (Johnson et al., 2002) might well prove superior. There is also active research around AGTSP algorithms. Recently new effective methods based on a “memetic” strategy (Buriol et al., 2004; Gutin et al., 2008) have been put forward. These methods combined with our proposed formulation provide ready-to-use SMT decoders, which it will be interesting to compare. Acknowledgments Thanks to Vassilina Nikoulina f - -Gutin, Karapetyan, Natalio, 2008 -Gregory Gutin, Daniel Karapetyan, and Krasnogor Natalio. 2008. Memetic algorithm for the generalized asymmetric traveling salesman problem. In NICSO 2007, pages 199–210. Springer Berlin. - - - -G Gutin - -Travelling salesman and related problems -2003 -In Handbook of Graph Theory - -SP is one of the most studied problems in combinatorial optimization, and even a brief review of existing approaches would take too much place. Interested readers may consult (Applegate et al., 2007; Gutin, 2003) for good introductions. One of the best existing TSP solvers is implemented in the open source Concorde package (Applegate et al., 2005). Concorde includes the fastest exact algorithm and one of the - -Gutin, 2003 -G. Gutin. 2003. Travelling salesman and related problems. In Handbook of Graph Theory. - - - -Hieu Hoang -Philipp Koehn - -Design of the Moses decoder for statistical machine translation -2008 -In ACL 2008 Software workshop -58--65 -ACL -Columbus, Ohio - -coding, but is also practically convenient: in the same amount of time, off-the-shelf TSP solvers can find higher scoring solutions than the state-of-the art beam-search decoder implemented in Moses (Hoang and Koehn, 2008). 2 Related work Beam-search decoding In beam-search decoding, candidate translation prefixes are iteratively extended with new phrases. In its most widespread variant, stack decoding, prefixes obtain - -Hoang, Koehn, 2008 -Hieu Hoang and Philipp Koehn. 2008. Design of the Moses decoder for statistical machine translation. In ACL 2008 Software workshop, pages 58–65, Columbus, Ohio, June. ACL. - - - -D S Johnson -G Gutin -L A McGeoch -A Yeo -W Zhang -A Zverovich - -Experimental analysis of heuristics for the atsp. In The Travelling Salesman Problem and Its Variations -2002 -445--487 - -nce, it might prove better to use directly algorithms expressly designed for ATSP or AGTSP. For instance, some of the algorithms tested in the context of the DIMACS implementation challenge for ATSP (Johnson et al., 2002) might well prove superior. There is also active research around AGTSP algorithms. Recently new effective methods based on a “memetic” strategy (Buriol et al., 2004; Gutin et al., 2008) have been put - -Johnson, Gutin, McGeoch, Yeo, Zhang, Zverovich, 2002 -D.S. Johnson, G. Gutin, L.A. McGeoch, A. Yeo, W. Zhang, and A. Zverovich. 2002. Experimental analysis of heuristics for the atsp. In The Travelling Salesman Problem and Its Variations, pages 445–487. - - - -Anthony C Kam -Gary E Kopec - -Document image decoding by heuristic search -1996 -IEEE Transactions on Pattern Analysis and Machine Intelligence -18--945 - - related to b as we described in section 4. 1, and then we recompute the optimal tour. Iterating this procedure provably converges to an optimal solution. This powerful method, which was proposed in (Kam and Kopec, 1996; Popat et al., 2001) in the context of a finite-state model (but not of TSP), can be easily extended to N-gram situations, and typically converges in a small number of iterations. 7 Conclusion The ma - -Kam, Kopec, 1996 -Anthony C. Kam and Gary E. Kopec. 1996. Document image decoding by heuristic search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18:945–950. - - - -Kevin Knight - -Decoding complexity in wordreplacement translation models -1999 -Computational Linguistics -25--607 - -thm is a crucial element of any statistical machine translation system. Some researchers have noted certain similarities between SMT decoding and the famous Traveling Salesman Problem; in particular (Knight, 1999) has shown that any TSP instance can be mapped to a sub-case of a word-based SMT model, demonstrating NP-hardness of the decoding task. In this paper, we focus on the reverse mapping, showing that any - the beginning. All these reasons motivate considering alternative decoding strategies. Word-based SMT and the TSP As already mentioned, the similarity between SMT decoding and TSP was recognized in (Knight, 1999), who focussed on showing that any TSP can be reformulated as a sub-class of the SMT decoding problem, proving that SMT decoding is NP-hard. Following this work, the existence of many efficient TSP al -is paper has been to propose a transformation for an arbitrary phrase- based SMT decoding instance into a TSP instance. While certain similarities of SMT decoding and TSP were already pointed out in (Knight, 1999), where it was shown that any Traveling Salesman Problem may be reformulated as an instance of a (simplistic) SMT decoding task, and while certain techniques used for TSP were then adapted to word-bas - -Knight, 1999 -Kevin Knight. 1999. Decoding complexity in wordreplacement translation models. Computational Linguistics, 25:607–615. - - - -Philipp Koehn -Franz Josef Och -Daniel Marcu - -Statistical phrase-based translation -2003 -In NAACL 2003 -48--54 -Association -for Computational Linguistics -Morristown, NJ, USA - -oach on three datasets, and compare a TSP-based decoder to the popular beam-search algorithm. In all cases, our method provides competitive or better performance. 1 Introduction Phrase-based systems (Koehn et al., 2003) are probably the most widespread class of Statistical Machine Translation systems, and arguably one of the most successful. They use aligned sequences of words, called biphrases, as building blocks f -o those with bigrams. 5.2 Translation experiments with a bigram language model In this section we consider two real translation tasks, namely, translation from English to French, trained on Europarl (Koehn et al., 2003) and translation from German to Spanish training on the NewsCommentary corpus. For Europarl, the training set includes 2.81 million sentences, and the test set 500. For NewsCommentary the training set - -Koehn, Och, Marcu, 2003 -Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In NAACL 2003, pages 48–54, Morristown, NJ, USA. Association for Computational Linguistics. - - - -Adam Lopez - -Statistical machine translation -2008 -ACM Comput. Surv -40 - - in the candidate translation deviates from their order in the source sentence. Given such a model, where the �Z’s have been tuned on a development set in order to minimize some error rate (see e.g. (Lopez, 2008)), together with a library of biphrases extracted from some large training corpus, a decoder implements the actual search among alternative translations: (a*, T*) = arg max (a,T) The decoding problem - -Lopez, 2008 -Adam Lopez. 2008. Statistical machine translation. ACM Comput. Surv., 40(3):1–49. - - - -C Noon -J C Bean - -An efficient transformation of the generalized traveling salesman problem -1993 -INFOR -39--44 - -ned for STSP, but ATSP, SGTSP and AGTSP may be reduced to STSP, and therefore solved by STSP algorithms. 3.1 Reductions AGTSP—*ATSP—*STSP The transformation of the AGTSP into the ATSP, introduced by (Noon and Bean, 1993)), is illustrated in Figure (1). In this diagram, we assume that Y1, ... , YK are the nodes of a given cluster, while X and Z are arbitrary nodes belonging to other clusters. In the transformed graph, - -Noon, Bean, 1993 -C. Noon and J.C. Bean. 1993. An efficient transformation of the generalized traveling salesman problem. INFOR, pages 39–44. - - - -Kishore Papineni -Salim Roukos -Todd Ward -Wei J Zhu - -BLEU: a Method for Automatic Evaluation of Machine Translation -2001 -IBM Research Report, RC22176 - - procedure is to plot its value versus the elapsed time. The sec3 Both TSP decoders may be used with/or without a distortion limit; in our experiments we do not use this parameter. ond score is BLEU (Papineni et al., 2001), computed between the reconstructed and the original sentences, which allows us to check how well the quality of reconstruction correlates with the internal score. The training dataset for learning t - -Papineni, Roukos, Ward, Zhu, 2001 -Kishore Papineni, Salim Roukos, Todd Ward, and Wei J. Zhu. 2001. BLEU: a Method for Automatic Evaluation of Machine Translation. IBM Research Report, RC22176. - - - -Kris Popat -Daniel H Greene -Justin K Romberg -Dan S Bloomberg - -Adding linguistic constraints to document image decoding: Comparing the iterated complete path and stack algorithms -2001 - -escribed in section 4. 1, and then we recompute the optimal tour. Iterating this procedure provably converges to an optimal solution. This powerful method, which was proposed in (Kam and Kopec, 1996; Popat et al., 2001) in the context of a finite-state model (but not of TSP), can be easily extended to N-gram situations, and typically converges in a small number of iterations. 7 Conclusion The main contribution of th - -Popat, Greene, Romberg, Bloomberg, 2001 -Kris Popat, Daniel H. Greene, Justin K. Romberg, and Dan S. Bloomberg. 2001. Adding linguistic constraints to document image decoding: Comparing the iterated complete path and stack algorithms. - - - -Christoph Tillmann -Hermann Ney - -Word reordering and a dynamic programming beam search algorithm for statistical machine translation -2003 -Comput. Linguist -29 - -ng”. By employing generic IP techniques, it is however impossible to rely on the variety of more efficient both exact and approximate approaches which have been designed specifically for the TSP. In (Tillmann and Ney, 2003) and (Tillmann, 2006), the authors modify a certain Dynamic Programming technique used for TSP for use with an IBM- 4 word-based model and a phrase-based model respectively. However, to our knowledge, - Salesman Problem may be reformulated as an instance of a (simplistic) SMT decoding task, and while certain techniques used for TSP were then adapted to word-based SMT decoding (Germann et al., 2001; Tillmann and Ney, 2003; Tillmann, 2006), we are not aware of any previous work that shows that SMT decoding can be directly reformulated as a TSP. Beside the general interest of this transformation for understanding decodi - -Tillmann, Ney, 2003 -Christoph Tillmann and Hermann Ney. 2003. Word reordering and a dynamic programming beam search algorithm for statistical machine translation. Comput. Linguist., 29(1):97–133. - - - -Christoph Tillmann - -Efficient Dynamic Programming Search Algorithms For Phrase-Based SMT -2006 -In Workshop On Computationally Hard Problems And Joint Inference In Speech And Language Processing - -techniques, it is however impossible to rely on the variety of more efficient both exact and approximate approaches which have been designed specifically for the TSP. In (Tillmann and Ney, 2003) and (Tillmann, 2006), the authors modify a certain Dynamic Programming technique used for TSP for use with an IBM- 4 word-based model and a phrase-based model respectively. However, to our knowledge, none of these works - reformulated as an instance of a (simplistic) SMT decoding task, and while certain techniques used for TSP were then adapted to word-based SMT decoding (Germann et al., 2001; Tillmann and Ney, 2003; Tillmann, 2006), we are not aware of any previous work that shows that SMT decoding can be directly reformulated as a TSP. Beside the general interest of this transformation for understanding decoding, it also opens - -Tillmann, 2006 -Christoph Tillmann. 2006. Efficient Dynamic Programming Search Algorithms For Phrase-Based SMT. In Workshop On Computationally Hard Problems And Joint Inference In Speech And Language Processing. - - - -Wikipedia - -Travelling Salesman Problem — Wikipedia, The Free Encyclopedia. [Online -2009 -accessed 5-May-2009 - -ding of some feasible tour of the AGTSP problem. As for the transformation ATSP—*STSP, several variants are described in the literature, e.g. (Applegate et al., 2007, p. 126); the one we use is from (Wikipedia, 2009) (not illustrated here for lack of space). 3.2 TSP algorithms TSP is one of the most studied problems in combinatorial optimization, and even a brief review of existing approaches would take too much - -Wikipedia, 2009 -Wikipedia. 2009. Travelling Salesman Problem — Wikipedia, The Free Encyclopedia. [Online; accessed 5-May-2009]. - - - -
\ No newline at end of file diff --git a/bin/34_1273675500_P09-1038.pdf.xml b/bin/34_1273675500_P09-1038.pdf.xml deleted file mode 100644 index 9355696..0000000 --- a/bin/34_1273675500_P09-1038.pdf.xml +++ /dev/null @@ -1,18977 +0,0 @@ - - - - - - - -en - - - - - - -
- - - - -Phrase-Based - -Statistical - -Machine - -Translation - -as - -a - -Traveling - - - -Salesman - - - - - -Problem - - - - - - - - - - -Mikhail - - - -Zaslavskiy - -* - - - - - -Marc - -Dymetman - - - -Nicola - -Cancedda - - - - - - - - - - - - - - -Mines - -ParisTech, - -Institut - -Curie - - - -Xerox - -Research - -Centre - -Europe - - - - - - - - - - - - - - -77305 - -Fontainebleau, - -France - - - -38240 - -Meylan, - -France - - - - - - - - - - - - - -mikhail.zaslavskiy@ensmp.fr - - - - -{ - -marc.dymetman,nicola.cancedda - -} - -@xrce.xerox.com - - - - - - - -
-
- - - -Abstract - - - - - - - -An - -efficient - -decoding - -algorithm - -is - -a - -cru- - - - -cial - -element - -of - -any - -statistical - -machine - - - - - - - -translation - -system. - -Some - -researchers - -have - - - - - - - -noted - -certain - -similarities - -between - -SMT - - - - - - -decoding - -and - -the - -famous - -Traveling - -Sales- - - - -man - -Problem; - -in - -particular - -(Knight, - -1999) - - - - - - - -has - -shown - -that - -any - -TSP - -instance - -can - -be - - - - - - - -mapped - -to - -a - -sub-case - -of - -a - -word-based - - - - - - - -SMT - -model, - -demonstrating - -NP-hardness - - - - - - -of - -the - -decoding - -task. - -In - -this - -paper, - -we - -fo- - - - -cus - -on - -the - -reverse - -mapping, - -showing - -that - - - - - - - -any - -phrase-based - -SMT - -decoding - -problem - - - - - - - -can - -be - -directly - -reformulated - -as - -a - -TSP. - -The - - - - - - - -transformation - -is - -very - -natural, - -deepens - -our - - - - - - - -understanding - -of - -the - -decoding - -problem, - - - - - - -and - -allows - -direct - -use - -of - -any - -of - -the - -pow- - - -erful - -existing - -TSP - -solvers - -for - -SMT - -de- - - - -coding. - -We - -test - -our - -approach - -on - -three - - - - - - -datasets, - -and - -compare - -a - -TSP-based - -de- - - -coder - -to - -the - -popular - -beam-search - -algo- - - - -rithm. - -In - -all - -cases, - -our - -method - -provides - - - - - - - -competitive - -or - -better - -performance. - - - - - - - - - -1 - -Introduction - - - - - - - - - -Phrase-based - -systems - -(Koehn - -et - -al., - -2003) - -are - - - - - - - -probably - -the - -most - -widespread - -class - -of - -Statistical - - - - - - - -Machine - -Translation - -systems, - -and - -arguably - -one - -of - - - - - - - -the - -most - -successful. - -They - -use - -aligned - -sequences - - - - - - - -of - -words, - -called - -biphrases, - -as - -building - -blocks - -for - - - - - - -translations, - -and - -score - -alternative - -candidate - -trans- - - - -lations - -for - -the - -same - -source - -sentence - -based - -on - -a - - - - - - - -log-linear - -model - -of - -the - -conditional - -probability - -of - - - - - - - -target - -sentences - -given - -the - -source - -sentence: - - - - - - - - - -p - -( - -T, - - - - - -a - -1 - -5 - -) - - - -= - -1 - - - - - - - - - - -Z - -S - - - - - -e - -xp - - - - - -1: - -A - -k - -h - -k - -( - -5, - - - -a, - - - -T - -) - - - - -(1) - - - - - - - -k - - - - - - - - -where - -the - - - - - -h - -k - - - -are - -features, - -that - -is, - -functions - -of - -the - - - - - - - -source - -string - - - - - -5 - -, - - - -of - -the - -target - -string - - - - - -T - -, - - - -and - -of - -the - - - - - - - - - - -* - - - -This - -work - -was - -conducted - -during - -an - -internship - -at - - - - - - -XRCE. - - - - - - - - - -alignment - - - - -a - -, - - - -where - -the - -alignment - -is - -a - -representa- - - - - -tion - -of - -the - -sequence - -of - -biphrases - -that - -where - -used - - - - - - - -in - -order - -to - -build - - - -T - - - -from - - - - - -5 - -; - - - -The - - - - - -� - -k - -’s - - - -are - -weights - - - - - - -and - - - - -Z - -S - - - -is - -a - -normalization - -factor - -that - -guarantees - - - - - - -that - - -p - - - -is - -a - -proper - -conditional - -probability - -distri- - - - - -bution - -over - -the - -pairs - - - - - -( - -T, - - - - - -A - -) - -. - - - -Some - -features - -are - - - - - - - -local - -, - - - -i.e. - -decompose - -over - -biphrases - -and - -can - -be - - - - - - -precomputed - -and - -stored - -in - -advance. - -These - -typ- - - -ically - -include - -forward - -and - -reverse - -phrase - -condi- - - - -tional - -probability - -features - - - -log - - - - - -p - -(� - -t - -1 - -s) - - - -as - -well - -as - - - - - - - -log - -p - -(s - -1 - -� - -t - -) - -, - - - -where - - - -9 - - - -is - -the - -source - -side - -of - -the - - - - - - - -biphrase - -and - - - - - -t - -� - - - -the - -target - -side, - -and - -the - -so-called - - - - - - - -“phrase - -penalty” - -and - -“word - -penalty” - -features, - - - - - - - -which - -count - -the - -number - -of - -phrases - -and - -words - -in - - - - - - - -the - -alignment. - -Other - -features - -are - - - - - -non-local - -, - - - -i.e. - - - - - - - -depend - -on - -the - -order - -in - -which - -biphrases - -appear - -in - - - - - - - -the - -alignment. - -Typical - -non-local - -features - -include - - - - - - - -one - -or - -more - -n-gram - -language - -models - -as - -well - -as - - - - - - - -a - -distortion - -feature, - -measuring - -by - -how - -much - -the - - - - - - -order - -of - -biphrases - -in - -the - -candidate - -translation - -de- - - - -viates - -from - -their - -order - -in - -the - -source - -sentence. - - - - - - - - - -Given - -such - -a - -model, - -where - -the - - - - - -� - -Z - -’s - - - -have - -been - - - - - - - -tuned - -on - -a - -development - -set - -in - -order - -to - -minimize - - - - - - - -some - -error - -rate - -(see - -e.g. - -(Lopez, - -2008)), - -together - - - - - - - -with - -a - -library - -of - -biphrases - -extracted - -from - -some - - - - - - - -large - -training - -corpus, - -a - - - -decoder - - - -implements - -the - - - - - - - -actual - -search - -among - -alternative - -translations: - - - - - - - - - -( - -a - -* - -, - - - - - -T - -* - -) - - - -= - -arg - -max - - - - - - - - - -( - -a,T - -) - - - - - - - - -The - -decoding - -problem - -(2) - -is - -a - -discrete - -optimiza- - - - -tion - -problem. - -Usually, - -it - -is - -very - -hard - -to - -find - -the - - - - - - -exact - -optimum - -and, - -therefore, - -an - -approximate - -so- - - - -lution - -is - -used. - -Currently, - -most - -decoders - -are - -based - - - - - - - -on - -some - -variant - -of - -a - -heuristic - -left-to-right - -search, - - - - - - - -that - -is, - -they - -attempt - -to - -build - -a - -candidate - -translation - - - - - - - -( - -a, - - - - - -T - -) - - - -incrementally, - -from - -left - -to - -right, - -extending - - - - - - - -the - -current - -partial - -translation - -at - -each - -step - -with - -a - - - - - - - -new - -biphrase, - -and - -computing - -a - -score - -composed - -of - - - - - - - -two - -contributions: - -one - -for - -the - -known - -elements - -of - - - - - - - -the - -partial - -translation - -so - -far, - -and - -one - -a - -heuristic - - - - - - - -
-
- - - - -P - -( - -T, - - - - - -a - -1 - -5 - -). - - - -(2) - - - - - - -
-
- - - -333 - - - - - - - - -Proceedings - -of - -the - -47th - -Annual - -Meeting - -of - -the - -ACL - -and - -the - -4th - -IJCNLP - -of - -the - - - -AFNLP - -, - - - -pages - -333–341, - - - - - - - - - -Suntec, - -Singapore, - -2-7 - -August - -2009. - - - -c - -� - -2009 - - - -ACL - -and - -AFNLP - - - - - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - - -estimate - -of - -the - -remaining - -cost - -for - -completing - -the - - - - - - - -translation. - -The - -variant - -which - -is - -mostly - -used - -is - - - - - - - -a - -form - -of - - - - - -beam-search - -, - - - -where - -several - -partial - -can- - - - - -didates - -are - -maintained - -in - -parallel, - -and - -candidates - - - - - - - -for - -which - -the - -current - -score - -is - -too - -low - -are - -pruned - - - - - - - -in - -favor - -of - -candidates - -that - -are - -more - -promising. - - - - - - - - -We - -will - -see - -in - -the - -next - -section - -that - -some - -char- - - - -acteristics - -of - -beam-search - -make - -it - -a - -suboptimal - - - - - - - -choice - -for - -phrase-based - -decoding, - -and - -we - -will - - - - - - - -propose - -an - -alternative. - -This - -alternative - -is - -based - -on - - - - - - - -the - -observation - -that - -phrase-based - -decoding - -can - -be - - - - - - -very - -naturally - -cast - -as - -a - -Traveling - -Salesman - -Prob- - - - -lem - -(TSP), - -one - -of - -the - -best - -studied - -problems - -in - - - - - - - -combinatorial - -optimization. - -We - -will - -show - -that - -this - - - - - - -formulation - -is - -not - -only - -a - -powerful - -conceptual - -de- - - -vice - -for - -reasoning - -on - -decoding, - -but - -is - -also - -prac- - - - -tically - -convenient: - -in - -the - -same - -amount - -of - -time, - - - - - - - -off-the-shelf - -TSP - -solvers - -can - -find - -higher - -scoring - - - - - - -solutions - -than - -the - -state-of-the - -art - -beam-search - -de- - - - -coder - -implemented - -in - - - -Moses - - - -(Hoang - -and - -Koehn, - - - - - - -2008). - - - - - - - - -2 - -Related - -work - - - - - - - - - -Beam-search - -decoding - - - - - - - - - -In - -beam-search - -decoding, - -candidate - -translation - - - - - - - -prefixes - -are - -iteratively - -extended - -with - -new - -phrases. - - - - - - - -In - -its - -most - -widespread - -variant, - - - -stack - - - -decoding - -, - - - - - - - -prefixes - -obtained - -by - -consuming - -the - -same - -number - - - - - - -of - -source - -words, - -no - -matter - -which, - -are - -grouped - -to- - - - -gether - -in - -the - -same - - - - - -stack - -1 - - - -and - -compete - -against - -one - - - - - - -another. - - -Threshold - - - -and - - - -histogram - - - -pruning - -are - -ap- - - - - -plied: - -the - -former - -consists - -in - -dropping - -all - -prefixes - - - - - - - -having - -a - -score - -lesser - -than - -the - -best - -score - -by - -more - - - - - - -than - -some - -fixed - -amount - -(a - -parameter - -of - -the - -algo- - - - -rithm), - -the - -latter - -consists - -in - -dropping - -all - -prefixes - - - - - - - -below - -a - -certain - -rank. - - - - - - - - -While - -quite - -successful - -in - -practice, - -stack - -decod- - - - -ing - -presents - -some - -shortcomings. - -A - -first - -one - -is - -that - - - - - - - -prefixes - -obtained - -by - -translating - -different - -subsets - - - - - - - -of - -source - -words - -compete - -against - -one - -another. - -In - - - - - - - -one - -early - -formulation - -of - -stack - -decoding - -for - -SMT - - - - - - -(Germann - -et - -al., - -2001), - -the - -authors - -indeed - -pro- - - - -posed - -to - -lazily - -create - -one - -stack - -for - -each - -subset - - - - - - - -of - -source - -words, - -but - -acknowledged - -issues - -with - - - - - - -the - -potential - -combinatorial - -explosion - -in - -the - -num- - - - -ber - -of - -stacks. - -This - -problem - -is - -reduced - -by - -the - -use - - - - - - - -of - -heuristics - -for - -estimating - -the - -cost - -of - -translating - - - - - - - -the - -remaining - -part - -of - -the - -source - -sentence. - -How- - - - - - - - - - - -1 - -While - - - -commonly - -adopted - -in - -the - -speech - -and - -SMT - -com- - - - -munities, - -this - -is - -a - -bit - -of - -a - -misnomer, - -since - -the - -used - -data - -struc- - - - -tures - -are - -priority - -queues, - -not - -stacks. - - - - - - - - - - - -ever, - -this - -solution - -is - -only - -partially - -satisfactory. - -On - - - - - - - -the - -one - -hand, - -heuristics - -should - -be - -computationally - - - - - - - -light, - -much - -lighter - -than - -computing - -the - -actual - -best - - - - - - -score - -itself, - -while, - -on - -the - -other - -hand, - -the - -heuris- - - - -tics - -should - -be - -tight, - -as - -otherwise - -pruning - -errors - - - - - - - -will - -ensue. - -There - -is - -no - -clear - -criterion - -to - -guide - - - - - - - -in - -this - -trade-off. - -Even - -when - -good - -heuristics - -are - - - - - - - -available, - -the - -decoder - -will - -show - -a - -bias - -towards - - - - - - - -putting - -at - -the - -beginning - -the - -translation - -of - -a - -certain - - - - - - - -portion - -of - -the - -source, - -either - -because - -this - -portion - - - - - - - -is - -less - -ambiguous - -(i.e. - -its - -translation - -has - -larger - - - - - - - -conditional - -probability) - -or - -because - -the - -associated - - - - - - -heuristics - -is - -less - -tight, - -hence - -more - -optimistic. - -Fi- - - - -nally, - -since - -the - -translation - -is - -built - -left-to-right - -the - - - - - - -decoder - -cannot - -optimize - -the - -search - -by - -taking - -ad- - - - -vantage - -of - -highly - -unambiguous - -and - -informative - - - - - - - -portions - -that - -should - -be - -best - -translated - -far - -from - -the - - - - - - - -beginning. - -All - -these - -reasons - -motivate - -considering - - - - - - - -alternative - -decoding - -strategies. - - - - - - - - - -Word-based - -SMT - -and - -the - -TSP - - - - - - - - - -As - -already - -mentioned, - -the - -similarity - -between - - - - - - - -SMT - -decoding - -and - -TSP - -was - -recognized - -in - - - - - - - -(Knight, - -1999), - -who - -focussed - -on - -showing - -that - - - - - - - -any - -TSP - -can - -be - -reformulated - -as - -a - -sub-class - -of - -the - - - - - - -SMT - -decoding - -problem, - -proving - -that - -SMT - -decod- - - -ing - -is - -NP-hard. - -Following - -this - -work, - -the - -exis- - - -tence - -of - -many - -efficient - -TSP - -algorithms - -then - -in- - - -spired - -certain - -adaptations - -of - -the - -underlying - -tech- - - - -niques - -to - -SMT - -decoding - -for - -word-based - -models. - - - - - - -Thus, - -(Germann - -et - -al., - -2001) - -adapt - -a - -TSP - -sub- - - - -tour - -elimination - -strategy - -to - -an - -IBM-4 - -model, - -us- - - - -ing - -generic - -Integer - -Programming - -techniques. - -The - - - - - - -paper - -comes - -close - -to - -a - -TSP - -formulation - -of - -de- - - - -coding - -with - -IBM-4 - -models, - -but - -does - -not - -pursue - - - - - - - -this - -route - -to - -the - -end, - -stating - -that - - - -“It - -is - -difficult - - - - - - - -to - -convert - -decoding - -into - -straight - -TSP, - -but - -a - -wide - - - - - - -range - -of - -combinatorial - -optimization - -problems - -(in- - - -cluding - -TSP) - -can - -be - -expressed - -in - -the - -more - -gen- - - - -eral - -framework - -of - -linear - -integer - - - -programming” - -. - - - - - - -By - -employing - -generic - -IP - -techniques, - -it - -is - -how- - - - -ever - -impossible - -to - -rely - -on - -the - -variety - -of - -more - - - - - - - -efficient - -both - -exact - -and - -approximate - -approaches - - - - - - - -which - -have - -been - -designed - -specifically - -for - -the - -TSP. - - - - - - - -In - -(Tillmann - -and - -Ney, - -2003) - -and - -(Tillmann, - -2006), - - - - - - -the - -authors - -modify - -a - -certain - -Dynamic - -Program- - - -ming - -technique - -used - -for - -TSP - -for - -use - -with - -an - -IBM- - - - -4 - -word-based - -model - -and - -a - -phrase-based - -model - -re- - - - -spectively. - -However, - -to - -our - -knowledge, - -none - -of - - - - - - - -these - -works - -has - -proposed - -a - -direct - -reformulation - - - - - - -of - -these - -SMT - -models - -as - -TSP - -instances. - -We - -be- - - - -lieve - -we - -are - -the - -first - -to - -do - -so, - -working - -in - -our - -case - - - - - - - -
-
- - -334 - - - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - - -with - -the - -mainstream - -phrase-based - -SMT - -models, - - - - - - - -and - -therefore - -making - -it - -possible - -to - -directly - -apply - - - - - - - -existing - -TSP - -solvers - -to - -SMT. - - - - - - - - - -3 - -The - -Traveling - -Salesman - -Problem - -and - - - - - - - -its - -variants - - - - - - - - -In - -this - -paper - -the - -Traveling - -Salesman - -Problem - -ap- - - - -pears - -in - -four - -variants: - - - - - - - - - -STSP - -. - - - -The - -most - -standard, - -and - -most - -studied, - - - - - - - -variant - -is - -the - - - -Symmetric - - - -TSP - -: - - - -we - -are - -given - -a - -non- - - - - - -directed - -graph - - - -G - - - -on - - - -N - - - -nodes, - -where - -the - -edges - - - - - - -carry - -real-valued - -costs. - -The - -STSP - -problem - -con- - - - -sists - -in - -finding - -a - -tour - -of - -minimal - -total - -cost, - -where - - - - - - -a - -tour - -(also - -called - -Hamiltonian - -Circuit) - -is - -a - -“cir- - - - -cular” - -sequence - -of - -nodes - -visiting - -each - -node - -of - -the - - - - - - - -graph - -exactly - -once; - - - - - - - - - -ATSP - -. - - - -The - - - -Asymmetric - - - -TSP - -, - - - -or - -ATSP, - -is - -a - -vari- - - - - -ant - -where - -the - -underlying - -graph - - - -G - - - -is - -directed - -and - - - - - - - -where, - -for - - - -i - - - -and - - - -j - - - -two - -nodes - -of - -the - -graph, - -the - - - - - - - -edges - - - -( - -i - -, - -j - -) - - - -and - - - -( - -j - -, - -i - -) - - - -may - -carry - -different - -costs. - - - - - - - - - -SGTSP - -. - - - -The - - - -Symmetric - -Generalized - - - -TSP - -, - - - -or - - - - - - - -SGTSP: - -given - -a - -non-oriented - -graph - - - -G - - - -of - - - - - -J - -G - -J - - - - - - - -nodes - -with - -edges - -carrying - -real-valued - -costs, - -given - - - - - - - -a - -partition - -of - -these - - - - - -J - -G - -J - - - -nodes - -into - - - -m - - - -non-empty, - - - - - - - -disjoint, - -subsets - -(called - -clusters), - -find - -a - -circular - - - - - - - -sequence - -of - - - -m - - - -nodes - -of - -minimal - -total - -cost, - -where - - - - - - - -each - -cluster - -is - -visited - -exactly - -once. - - - - - - - - - -AGTSP - -. - - - -The - - - -Asymmetric - -Generalized - - - -TSP - -, - - - -or - - - - - - - -AGTSP: - -similar - -to - -the - -SGTSP, - -but - - - -G - - - -is - -now - -a - -di- - - - - -rected - -graph. - - - - - - - - - -The - -STSP - -is - -often - -simply - -denoted - -TSP - -in - -the - - - - - - - -literature, - -and - -is - -known - -to - -be - -NP-hard - -(Applegate - - - - - - - -et - -al., - -2007); - -however - -there - -has - -been - -enormous - - - - - - - -interest - -in - -developing - -efficient - -solvers - -for - -it, - -both - - - - - - - -exact - -and - -approximate. - - - - - - - - - -Most - -of - -existing - -algorithms - -are - -designed - -for - - - - - - - -STSP - -, - - - -but - - - - - -ATSP - -, - - - -SGTSP - - - -and - - - -AGTSP - - - -may - -be - -re- - - - - -duced - -to - - - - - -STSP - -, - - - -and - -therefore - -solved - -by - - - -STSP - - - -al- - - - -gorithms. - - - - - - - - -3.1 - -Reductions - - - -AGTSP - -—* - -ATSP - -—* - -STSP - - - - - - - - - -The - -transformation - -of - -the - -AGTSP - -into - -the - -ATSP, - - - - - - -introduced - -by - -(Noon - -and - -Bean, - -1993)), - -is - -illus- - - - -trated - -in - -Figure - -(1). - -In - -this - -diagram, - -we - -assume - - - - - - -that - - - - -Y - -1 - -, - - - -... - -, - - - -Y - -K - - - -are - -the - -nodes - -of - -a - -given - -cluster, - - - - - - -while - - -X - - - -and - - - -Z - - - -are - -arbitrary - -nodes - -belonging - -to - - - - - - -other - -clusters. - -In - -the - -transformed - -graph, - -we - -in- - - - -troduce - -edges - -between - -the - - - - - -Y - -� - -’s - - - -in - -order - -to - -form - -a - - - - - - - -cycle - -as - -shown - -in - -the - -figure, - -where - -each - -edge - -has - - - - - - - -a - -large - -negative - -cost - - - - - -— - -K - -. - - - -We - -leave - -alone - -the - -in- - - - - -coming - -edge - -to - - - - - -Y - -� - - - -from - - - - - -X - -, - - - -but - -the - -outgoing - -edge - - - - - - - - - - - - - -Figure - -1: - - - -AGTSP - -—* - -ATSP. - - - - - - - - -from - - - - -Y - -� - - - -to - - - -X - - - -has - -its - -origin - -changed - -to - - - - - -Y - -� - -_ - -1 - -. - - - -A - - - - - - -feasible - -tour - -in - -the - -original - -AGTSP - -problem - -pass- - - - -ing - -through - - - -X, - - - -Y - -� - -, - - - -Z - - - -will - -then - -be - -“encoded” - -as - -a - - - - - - - -tour - -of - -the - -transformed - -graph - -that - -first - -traverses - - - - - - -X - - -, - -then - -traverses - - - - - -Y - -� - -, - - - -... - -, - - - -Y - -K - -, - - - -... - -, - - - -Y - -� - -_ - -1 - -, - - - -then - -tra- - - - -verses - - -Z - - - -(this - -encoding - -will - -have - -the - -same - -cost - -as - - - - - - - -the - -original - -cost, - -minus - - - - - -( - -k - - - -— - - - - - -1) - -K - -). - - - -Crucially, - -if - - - - - - -K - - -is - -large - -enough, - -then - -the - -solver - -for - -the - -trans- - - - - -formed - -ATSP - -graph - -will - -tend - -to - -traverse - -as - -many - - - - - - -K - - -edges - -as - -possible, - -meaning - -that - -it - -will - -traverse - - - - - - -exactly - - -k - - - -— - - - -1 - - - -such - -edges - -in - -the - -cluster, - -that - -is, - -it - - - - - - - -will - -produce - -an - -encoding - -of - -some - -feasible - -tour - -of - - - - - - - -the - -AGTSP - -problem. - - - - - - - - - -As - -for - -the - -transformation - - - -ATSP - -—* - -STSP, - - - -several - - - - - - -variants - -are - -described - -in - -the - -literature, - -e.g. - -(Ap- - - - -plegate - -et - -al., - -2007, - -p. - -126); - -the - -one - -we - -use - -is - -from - - - - - - - -(Wikipedia, - -2009) - -(not - -illustrated - -here - -for - -lack - -of - - - - - - -space). - - - - - - - - -3.2 - -TSP - -algorithms - - - - - - - - -TSP - -is - -one - -of - -the - -most - -studied - -problems - -in - -com- - - - -binatorial - -optimization, - -and - -even - -a - -brief - -review - -of - - - - - - - -existing - -approaches - -would - -take - -too - -much - -place. - - - - - - - -Interested - -readers - -may - -consult - -(Applegate - -et - -al., - - - - - - - -2007; - -Gutin, - -2003) - -for - -good - -introductions. - - - - - - - - -One - -of - -the - -best - -existing - -TSP - -solvers - -is - -imple- - - - -mented - -in - -the - -open - -source - - - -Concorde - - - -package - -(Ap- - - - - -plegate - -et - -al., - -2005). - - - -Concorde - - - -includes - -the - -fastest - - - - - - -exact - -algorithm - -and - -one - -of - -the - -most - -efficient - -im- - - -plementations - -of - -the - -Lin-Kernighan - -(LK) - -heuris- - - - -tic - -for - -finding - -an - -approximate - -solution. - -LK - -works - - - - - - - -by - -generating - -an - -initial - -random - -feasible - -solution - - - - - - -for - -the - -TSP - -problem, - -and - -then - -repeatedly - -identi- - - - -fying - -an - -ordered - -subset - -of - - - -k - - - -edges - -in - -the - -current - - - - - - - -tour - -and - -an - -ordered - -subset - -of - - - -k - - - -edges - -not - -included - - - - - - - -in - -the - -tour - -such - -that - -when - -they - -are - -swapped - -the - - - - - - - -objective - -function - -is - -improved. - -This - -is - -somewhat - - - - - - - -
-
- - -335 - - - - - -
- -
-
- - - - - - - -en - - - - - - - - - -
- - - - -reminiscent - -of - -the - - - -Greedy - -decoding - - - -of - -(Germann - - - - - - - -et - -al., - -2001), - -but - -in - -LK - -several - -transformations - -can - - - - - - - -be - -applied - -simultaneously, - -so - -that - -the - -risk - -of - -being - - - - - - - -stuck - -in - -a - -local - -optimum - -is - -reduced - -(Applegate - -et - - - - - - - -al., - -2007, - -chapter - -15). - - - - - - - - -As - -will - -be - -shown - -in - -the - -next - -section, - -phrase- - - - - -based - -SMT - -decoding - -can - -be - -directly - -reformulated - - - - - - - -as - -an - -AGTSP. - -Here - -we - -use - - - -Concorde - - - -through - - - - - - - -first - -transforming - -AGTSP - -into - -STSP, - -but - -it - -might - - - - - - - -also - -be - -interesting - -in - -the - -future - -to - -use - -algorithms - - - - - - -specifically - -designed - -for - -AGTSP, - -which - -could - -im- - - - -prove - -efficiency - -further - -(see - -Conclusion). - - - - - - - - - -4 - -Phrase-based - -Decoding - -as - -TSP - - - - - - - - - -In - -this - -section - -we - -reformulate - -the - -SMT - -decoding - - - - - - - -problem - -as - -an - - - - - -AGTSP - -. - - - -We - -will - -illustrate - -the - -ap- - - - - -proach - -through - -a - -simple - -example: - -translating - -the - - - - - - - -French - -sentence - - - -“cette - -traduction - -automatique - -est - - - - - - - -curieuse - -” - - - -into - -English. - -We - -assume - -that - -the - -rele- - - - - -vant - -biphrases - -for - -translating - -the - -sentence - -are - -as - - - - - - -follows: - - - - - - - - - - - -432 -1925 -1680 -226 -225 -207 -211 -206 -207 -206 -206 -207 -221 - - - - - - - - -ID - - - - - - - - - - - - - -source - - - - - - - - - - - - - -target - - - - - - - - - - - - - -h - - - - - - - - - - - - - -cette - - - - - - - - - - - - - -this - - - - - - - - - - - - - -t - - - - - - - - - - - - - -traduction - - - - - - - - - - - - - -translation - - - - - - - - - - - - - -ht - - - - - - - - - - - - - - -cette - -traduction - - - - - - - - - - - - - - - -this - -translation - - - - - - - - - - - - - - -mt - - - - - - - - - - - - - - -traduction - -automatique - - - - - - - - - - - - - - - -machine - -translation - - - - - - - - - - - - - - -a - - - - - - - - - - - - - -automatique - - - - - - - - - - - - - -automatic - - - - - - - - - - - - - -m - - - - - - - - - - - - - -automatique - - - - - - - - - - - - - -machine - - - - - - - - - - - - - -i - - - - - - - - - - - - - -est - - - - - - - - - - - - - -is - - - - - - - - - - - - - -s - - - - - - - - - - - - - -curieuse - - - - - - - - - - - - - -strange - - - - - - - - - - - - - -c - - - - - - - - - - - - - -curieuse - - - - - - - - - - - - - -curious - - - - - - -
- - - -Under - -this - -model, - -we - -can - -produce, - -among - -others, - - - - - - - -the - -following - -translations: - - - - - - - - - - -h - -mt - -i - -s - - - - - -this - -machine - -translation - -is - -strange - - - - - - - - - - -h - -c - -t - -i - -a - - - - - -this - -curious - -translation - -is - -automatic - - - - - - - - - - -ht - -s - -i - -a - - - - - -this - -translation - -strange - -is - -automatic - - - - - - - - -where - -we - -have - -indicated - -on - -the - -left - -the - -ordered - -se- - - - -quence - -of - -biphrases - -that - -leads - -to - -each - -translation. - - - - - - - - - -We - -now - -formulate - -decoding - -as - -an - -AGTSP, - -in - - - - - - - -the - -following - -way. - -The - -graph - -nodes - -are - -all - -the - - - - - - - -possible - -pairs - - - - - -( - -w, - - - - - -b - -) - -, - - - -where - - - -w - - - -is - -a - -source - -word - -in - - - - - - - -the - -source - -sentence - - - -s - - - -and - - - -b - - - -is - -a - -biphrase - -contain- - - - - -ing - -this - -source - -word. - -The - -graph - -clusters - -are - -the - - - - - - - -subsets - -of - -the - -graph - -nodes - -that - -share - -a - -common - - - - - - - -source - -word - - - - - -w - -. - - - - - - - - - -The - -costs - -of - -a - -transition - -between - -nodes - - - -M - - - -and - - - - - - -N - - -of - -the - -graph - -are - -defined - -as - -follows: - - - - - - - - -(a) - - -If - - - -M - - - -is - -of - -the - -form - - - - - -( - -w, - - - - - -b - -) - - - -and - - - -N - - - -of - -the - -form - - - - - - - -( - -w - -' - -, - - - - - -b - -) - -, - - - -in - -which - - - -b - - - -is - -a - -single - -biphrase, - -and - - - -w - - - -and - - - - - - - - - -w - -' - - - -are - -consecutive - -words - -in - - - - - -b - -, - - - -then - -the - -transition - - - - - - - -cost - -is - -0: - -once - -we - -commit - -to - -using - -the - -first - -word - - - - - - -of - - - - -b - -, - - - -there - -is - -no - -additional - -cost - -for - -traversing - -the - - - - - - -
- - - - -other - -source - -words - -covered - -by - - - - - -b - -. - - - - - - - - -(b) - - -If - - - -M - - - -= - - - -( - -w, - - - - - -b - -) - -, - - - -where - - - -w - - - -is - -the - - - -rightmost - - - - - - - -source - -word - - - -in - -the - -biphrase - - - - - -b - -, - - - -and - - - -N - - - -= - - - -( - -w - -' - -, - - - - - -b - -' - -) - -, - - - - - - -where - - - - -w - -' - - - - - -= - -� - - - -w - - - -is - -the - - - -leftmost - -source - -word - - - -in - - - - - -b - -' - -, - - - - - - - -then - -the - -transition - -cost - -corresponds - -to - -the - -cost - - - - - - - -of - -selecting - - - - - -b - -' - - - -just - -after - - - - - -b - -; - - - -this - -will - -correspond - - - - - - - -to - -“consuming” - -the - -source - -side - -of - - - - - -b - -' - - - -after - -having - - - - - - - -consumed - -the - -source - -side - -of - - - -b - - - -(whatever - -their - -rel- - - - -ative - -positions - -in - -the - -source - -sentence), - -and - -to - -pro- - - - -ducing - -the - -target - -side - -of - - - - - -b - -' - - - -directly - -after - -the - -target - - - - - - - -side - -of - - - - - -b - -; - - - -the - -transition - -cost - -is - -then - -the - -addition - -of - - - - - - - -several - -contributions - -(weighted - -by - -their - -respective - - - - - - -A - - -(not - -shown), - -as - -in - -equation - -1): - - - - - - - - - - - -• - -The - -cost - -associated - -with - -the - -features - -local - -to - - - - - - -b - - -in - -the - -biphrase - -library; - - - - - - - - - - - -• - -The - -“distortion” - -cost - -of - -consuming - -the - - - - - - - -source - -word - - - - - -w - -' - - - -just - -after - -the - -source - -word - - - - - -w - -: - - - - - - - - - -1 - -pos - -( - -w - -' - -) - - - -— - - - - - -pos - -( - -w - -) - - - -— - - - - - -1 - -1 - -, - - - -where - - - -pos - -( - -w - -) - - - -and - - - - - - - - - -pos - -( - -w - -' - -) - - - -are - -the - -positions - -of - - - -w - - - -and - - - - - -w - -' - - - -in - -the - - - - - - - -source - -sentence. - - - - - - - - - - - -• - -The - -language - -model - -cost - -of - -producing - -the - - - - - - - -target - -words - -of - - - - - -b - -' - - - -right - -after - -the - -target - -words - - - - - - -of - - - - -b - -; - - - -with - -a - -bigram - -language - -model, - -this - -cost - - - - - - - -can - -be - -precomputed - -directly - -from - - - -b - - - -and - - - - - -b - -' - -. - - - - - - -This - -restriction - -to - -bigram - -models - -will - -be - -re- - - - -moved - -in - -Section - -4.1. - - - - - - - - -(c) - - -In - -all - -other - -cases, - -the - -transition - -cost - -is - -infinite, - - - - - - - -or, - -in - -other - -words, - -there - -is - -no - -edge - -in - -the - -graph - - - - - - -between - - -M - - - -and - - - - - -N - -. - - - - - - - - -A - -special - -cluster - -containing - -a - -single - -node - -(de- - - - -noted - -by - -$-$$ - -in - -the - -figures), - -and - -corresponding - -to - - - - - - -special - - -beginning-of-sentence - - - -symbols - -must - -also - - - - - - - -be - -included: - -the - -corresponding - -edges - -and - -weights - - - - - - - -can - -be - -worked - -out - -easily. - -Figures - -2 - -and - -3 - -give - - - - - - - -some - -illustrations - -of - -what - -we - -have - -just - -described. - - - - - - - - - -4.1 - -From - -Bigram - -to - -N-gram - -LM - - - - - - - - - -Successful - -phrase-based - -systems - -typically - -employ - - - - - - -language - -models - -of - -order - -higher - -than - -two. - -How- - - -ever, - -our - -models - -so - -far - -have - -the - -following - -impor- - - - -tant - -“Markovian” - -property: - -the - -cost - -of - -a - -path - -is - - - - - - - -additive - -relative - -to - -the - -costs - -of - -transitions. - -For - - - - - - - -example, - -in - -the - -example - -of - -Figure - -3, - -the - -cost - -of - - - - - - -this - - -• - - - -machine - -translation - - - -• - - - -is - - - -• - - - - - -strange - -, - - - -can - -only - - - - - - - -take - -into - -account - -the - -conditional - -probability - -of - -the - - - - - - -word - - -strange - - - -relative - -to - -the - -word - - - - - -is - -, - - - -but - -not - -rela- - - - - -tive - -to - -the - -words - - - -translation - - - -and - - - - - -is - -. - - - -If - -we - -want - -to - - - - - - - -extend - -the - -power - -of - -the - -model - -to - -general - -n-gram - - - - - - - -language - -models, - -and - -in - -particular - -to - -the - -3-gram - - - - - - - -
-
- - -336 - - - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - - - - -Figure - -2: - -Transition - -graph - -for - -the - -source - -sentence - - - - - - - -cette - -traduction - -automatique - -est - - - -curieuse - -. - - - -Only - - - - - - - -edges - -entering - -or - -exiting - -the - -node - -traduction - - - -— - - -mt - - - - - - -are - -shown. - -The - -only - -successor - -to - - - - - -[ - -traduction - - - -— - - - - - -mt - -] - - - -is - - - - - -[ - -automatique - - - -— - - - -mt - -] - -, - - - -and - - - - - -[ - -cette - - - -— - - - -ht - -] - - - -is - -not - -a - - - - - - - -predecessor - -of - - - - - -[ - -traduction - - - -— - - - -mt - -] - -. - - - - - - - - - - -Figure - -3: - -A - -GTSP - -tours - -is - -illustrated, - -correspond- - - - -ing - -to - -the - -displayed - -output. - - - - - - - - -case - -(on - -which - -we - -concentrate - -here, - -but - -the - -tech- - - - -niques - -can - -be - -easily - -extended - -to - -the - -general - -case), - - - - - - - -the - -following - -approach - -can - -be - -applied. - - - - - - - - - -Compiling - -Out - -for - -Trigram - -models - - - - - - - - - -This - -approach - -consists - -in - -“compiling - -out” - -all - - - - - - - -biphrases - -with - -a - -target - -side - -of - -only - -one - -word. - - - - - - - -We - -replace - -each - -biphrase - - - -b - - - -with - -single-word - -tar- - - - - -get - -side - -by - -“extended” - -biphrases - - - - - -b - -i - -, - - - -... - -, - - - -b - -r - -, - - - -which - - - - - - - -are - -“concatenations” - -of - - - -b - - - -and - -some - -other - -biphrase - - - - - - - - - -b - -� - - - -in - -the - - - -library. - -2 - - - -To - -give - -an - -example, - -consider - - - - - - - - - -that - -we: - -(1) - -remove - -from - -the - -biphrase - -library - -the - - - - - - -biphrase - - - - -i - -, - - - -which - -has - -a - -single - -word - -target, - -and - -(2) - - - - - - - -add - -to - -the - -library - -the - -extended - -biphrases - - - - - -mti - -, - - - - - -ti - -, - - - - - - - -si, - - - -... - -, - - - -that - -is, - -all - -the - -extended - -biphrases - -consist- - - - - -ing - -of - -the - -concatenation - -of - -a - -biphrase - -in - -the - -library - - - - - - -with - - - - -i - -, - - - -then - -it - -is - -clear - -that - -these - -extended - -biphrases - - - - - - - -will - -provide - -enough - -context - -to - -compute - -a - -trigram - - - - - - -probability - -for - -the - -target - -word - -produced - -immedi- - - - -ately - -next - -(in - -the - -examples, - -for - -the - -words - - - - - -strange - -, - - - - - - - - - - -2 - -In - - - -the - -figures, - -such - -“concatenations” - -are - -denoted - -by - - - - - - - -[ - -b - -' - - - -• - - - -b - -] - - - -; - -they - -are - -interpreted - -as - -encapsulations - -of - -first - -con- - - - - -suming - -the - -source - -side - -of - - - - - -b - -' - -, - - - -whether - -or - -not - -this - -source - -side - - - - - - - -precedes - -the - -source - -side - -of - - - -b - - - -in - -the - -source - - - -sentence - -, - - - -produc- - - - - -ing - -the - -target - -side - -of - - - - - -b - -' - -, - - - -consuming - -the - -source - -side - -of - - - - - -b - -, - - - -and - - - - - - - -producing - -the - -target - -side - -of - - - -b - - - -immediately - -after - -that - -of - - - - - -b - -' - -. - - - - - - - - - - - - - -Figure - -4: - -Compiling-out - -of - -biphrase - - - - - -i - -: - - - -(est,is). - - - - - - - - -automatic - - -and - - - -automatic - - - -respectively). - -If - -we - -do - - - - - - - -that - -exhaustively - -for - -all - -biphrases - -(relevant - -for - -the - - - - - - - -source - -sentence - -at - -hand) - -that, - -like - - - - - -i - -, - - - -have - -a - -single- - - - - - -word - -target, - -we - -will - -obtain - -a - -representation - -that - - - - - - - -allows - -a - -trigram - -language - -model - -to - -be - -computed - - - - - - - -at - -each - -point. - - - - - - - - -The - -situation - -becomes - -clearer - -by - -looking - -at - -Fig- - - - -ure - -4, - -where - -we - -have - -only - -eliminated - -the - -biphrase - - - - - - - -i - -, - - - -and - -only - -shown - -some - -of - -the - -extended - -biphrases - - - - - - - -that - -now - -encapsulate - - - - - -i - -, - - - -and - -where - -we - -show - -one - - - - - - -valid - -circuit. - -Note - -that - -we - -are - -now - -able - -to - -as- - - - -sociate - -with - -the - -edge - -connecting - -the - -two - -nodes - - - - - - - -( - -est - -, - - - - - -mti - -) - - - -and - - - - - -( - -curieuse - -, - - - - - -s - -) - - - -a - -trigram - -cost - -because - - - - - - -mti - - -provides - -a - -large - -enough - -target - -context. - - - - - - - - - -While - -this - -exhaustive - -“compiling - -out” - -method - - - - - - - -works - -in - -principle, - -it - -has - -a - -serious - -defect: - -if - -for - - - - - - - -the - -sentence - -to - -be - -translated, - -there - -are - - - -m - - - -relevant - - - - - - - -biphrases, - -among - -which - - - -k - - - -have - -single-word - -tar- - - - - -gets, - -then - -we - -will - -create - -on - -the - -order - -of - - - -km - - - -ex- - - - -tended - -biphrases, - -which - -may - -represent - -a - -signif- - - - -icant - -overhead - -for - -the - -TSP - -solver, - -as - -soon - -as - - - -k - - - - - - - -is - -large - -relative - -to - - - - - -m - -, - - - -which - -is - -typically - -the - -case. - - - - - - - -The - -problem - -becomes - -even - -worse - -if - -we - -extend - -the - - - - - - - -compiling-out - -method - -to - -n-gram - -language - -models - - - - - - -with - - -n - -> - - - - - -3 - -. - - - -In - -the - -Future - -Work - -section - -below, - - - - - - -we - -describe - -a - -powerful - -approach - -for - -circumvent- - - -ing - -this - -problem, - -but - -with - -which - -we - -have - -not - -ex- - - - -perimented - -yet. - - - - - - - - - -5 - -Experiments - - - - - - - - - -5.1 - -Monolingual - -word - -re-ordering - - - - - - - - - -In - -the - -first - -series - -of - -experiments - -we - -consider - -the - - - - - - - -artificial - -task - -of - -reconstructing - -the - -original - -word - - - - - - -order - -of - -a - -given - -English - -sentence. - -First, - -we - -ran- - - - -domly - -permute - -words - -in - -the - -sentence, - -and - -then - - - - - - - -we - -try - -to - -reconstruct - -the - -original - -order - -by - -max- - - - - - - - -
-
- - -337 - - - - - -
- -
-
- - - - - - - -en - - - - - - -
-
- - - -Time - -(sec) - - - - - - -
-
- - -
-
- - -
-
- - -−0.1 - - - - - - - -−0.2 - - - - - -
-
- - -−0.3 - - - - - -
-
- - - - - -−0.410 - -0 - - - - - -10 - -2 - - - - - -10 - -4 - - - - - - -
-
- - -0.1 - - - - - -
-
- - -0 - - - - - -
-
- - -BEAM−SEARCH - - - - - -TSP - - - - - -
- -
- - - - - - -−0.810 - -0 - - - - - -10 - -2 - - - - - -10 - -4 - - - - - - - - - -Time - -(sec) - - - - - - - -
-
-
- - -0.2 - - - - - -
-
- - - - - - - - - - -BEAM−SEARCH - - - - - -TSP - - - - - -
-
- - - - -
-
- - -
-
- - -0 - - - - - -
-
- - -−0.2 - - - - - - - -−0.4 - - - - - -
-
- - -−0.6 - - - - - -
- -
- - - - -(a) - -(b) - -(c) - -(d) - - - - - - - - - -Figure - -5: - -(a), - -(b): - -LM - -and - -BLEU - -scores - -as - -functions - -of - -time - -for - -a - -bigram - -LM; - -(c), - -(d): - -the - -same - -for - - - - - - - -a - -trigram - -LM. - -The - -x - -axis - -corresponds - -to - -the - -cumulative - -time - -for - -processing - -the - -test - -set; - -for - -(a) - -and - -(c), - - - - - - - -the - -y - -axis - -corresponds - -to - -the - -mean - -difference - -(over - -all - -sentences) - -between - -the - -lm - -score - -of - -the - -output - - - - - - - -and - -the - -lm - -score - -of - -the - -reference - -normalized - -by - -the - -sentence - -length - -N: - -(LM(ref)-LM(true))/N. - -The - -solid - - - - - - - -line - -with - -star - -marks - -corresponds - -to - -using - -beam-search - -with - -different - -pruning - -thresholds, - -which - -result - -in - - - - - - - -different - -processing - -times - -and - -performances. - -The - -cross - -corresponds - -to - -using - -the - -exact-TSP - -decoder - -(in - - - - - - - -this - -case - -the - -time - -to - -the - -optimal - -solution - -is - -not - -under - -the - -user’s - -control). - - - - - - - -
-
- - - -imizing - -the - -LM - -score - -over - -all - -possible - -permuta- - - - -tions. - -The - -reconstruction - -procedure - -may - -be - -seen - - - - - - - -as - -a - -translation - -problem - -from - -“Bad - -English” - -to - - - - - - - -“Good - -English”. - -Usually - -the - -LM - -score - -is - -used - - - - - - - -as - -one - -component - -of - -a - -more - -complex - -decoder - - - - - - - -score - -which - -also - -includes - -biphrase - -and - -distortion - - - - - - - -scores. - -But - -in - -this - -particular - -“translation - -task” - - - - - - - -from - -bad - -to - -good - -English, - -we - -consider - -that - -all - - - - - - - - - -“biphrases” - -are - -of - -the - -form - - - -e - - - -— - - - -e - -, - - - -where - - - -e - - - -is - -an - - - - - - - - - -English - -word, - -and - -we - -do - -not - -take - -into - -account - - - - - - - -any - -distortion: - -we - -only - -consider - -the - -quality - -of - - - - - - -the - -permutation - -as - -it - -is - -measured - -by - -the - -LM - -com- - - - -ponent. - -Since - -for - -each - -“source - -word” - - - - - -e - -, - - - -there - -is - - - - - - - - - -exactly - -one - -possible - -“biphrase” - - - -e - - - -— - - -e - - -each - -clus- - - - - - - - - - -ter - -of - -the - -Generalized - -TSP - -representation - -of - -the - - - - - - - -decoding - -problem - -contains - -exactly - -one - -node; - -in - - - - - - - -other - -terms, - -the - -Generalized - -TSP - -in - -this - -situation - - - - - - - -is - -simply - -a - -standard - -TSP. - -Since - -the - -decoding - -phase - - - - - - - -is - -then - -equivalent - -to - -a - -word - -reordering, - -the - -LM - - - - - - - -score - -may - -be - -used - -to - -compare - -the - -performance - - - - - - -of - -different - -decoding - -algorithms. - -Here, - -we - -com- - - -pare - -three - -different - -algorithms: - -classical - -beam- - - - - -search - - - -( - -Moses - -); - - - -a - -decoder - -based - -on - -an - -exact - -TSP - - - - - - - -solver - - - -( - -Concorde - -); - - - -a - -decoder - -based - -on - -an - -approx- - - - - -imate - -TSP - -solver - -(Lin-Kernighan - -as - -implemented - - - - - - - -in - -the - -Concorde - -solver) - - - - - -3 - -. - - - -In - -the - -Beam-search - - - - - - - -and - -the - -LK-based - -TSP - -solver - -we - -can - -control - -the - - - - - - -trade-off - -between - -approximation - -quality - -and - -run- - - - -ning - -time. - -To - -measure - -re-ordering - -quality, - -we - -use - - - - - - - -two - -scores. - -The - -first - -one - -is - -just - -the - -“internal” - -LM - - - - - - -score; - -since - -all - -three - -algorithms - -attempt - -to - -maxi- - - - -mize - -this - -score, - -a - -natural - -evaluation - -procedure - -is - - - - - - -to - -plot - -its - -value - -versus - -the - -elapsed - -time. - -The - -sec- - - - - - - -3 - - - -Both - -TSP - -decoders - -may - -be - -used - -with/or - -without - -a - - - -distor- - - - - -tion - - - -limit - -; - - - -in - -our - -experiments - -we - -do - -not - -use - -this - -parameter. - - - - - - - - - - -ond - -score - -is - -BLEU - -(Papineni - -et - -al., - -2001), - -com- - - - -puted - -between - -the - -reconstructed - -and - -the - -original - - - - - - - -sentences, - -which - -allows - -us - -to - -check - -how - -well - -the - - - - - - -quality - -of - -reconstruction - -correlates - -with - -the - -inter- - - - -nal - -score. - -The - -training - -dataset - -for - -learning - -the - -LM - - - - - - -consists - -of - -50000 - -sentences - -from - -NewsCommen- - - - -tary - -corpus - -(Callison-Burch - -et - -al., - -2008), - -the - -test - - - - - - -dataset - -for - -word - -reordering - -consists - -of - -170 - -sen- - - - -tences, - -the - -average - -length - -of - -test - -sentences - -is - -equal - - - - - - - -to - -17 - -words. - - - - - - - - - -Bigram - -based - -reordering. - - - -First - -we - -consider - - - - - - - -a - -bigram - -Language - -Model - -and - -the - -algorithms - -try - - - - - - - -to - -find - -the - -re-ordering - -that - -maximizes - -the - -LM - - - - - - - -score. - -The - -TSP - -solver - -used - -here - -is - -exact, - -that - -is, - - - - - - - -it - -actually - -finds - -the - -optimal - -tour. - -Figures - -5(a,b) - - - - - - -present - -the - -performance - -of - -the - -TSP - -and - -Beam- - - - - -search - -based - -methods. - - - - - - - - - -Trigram - -based - -reordering. - - - -Then - -we - -consider - - - - - - -a - -trigram - -based - -Language - -Model - -and - -the - -algo- - - - -rithms - -again - -try - -to - -maximize - -the - -LM - -score. - -The - - - - - - - -trigram - -model - -used - -is - -a - -variant - -of - -the - -exhaustive - - - - - - - -compiling-out - -procedure - -described - -in - -Section - -4.1. - - - - - - - -Again, - -we - -use - -an - -exact - -TSP - -solver. - - - - - - - - -Looking - -at - -Figure - -5a, - -we - -see - -a - -somewhat - -sur- - - - -prising - -fact: - -the - -cross - -and - -some - -star - -points - -have - - - - - - -positive - -y - -coordinates! - -This - -means - -that, - -when - -us- - - - -ing - -a - -bigram - -language - -model, - -it - -is - -often - -possible - - - - - - -to - -reorder - -the - -words - -of - -a - -randomly - -permuted - -ref- - - - -erence - -sentence - -in - -such - -a - -way - -that - -the - -LM - -score - - - - - - - -of - -the - -reordered - -sentence - -is - -larger - -than - -the - -LM - -of - - - - - - - -the - -reference. - -A - -second - -notable - -point - -is - -that - -the - - - - - - - -increase - -in - -the - -LM-score - -of - -the - -beam-search - -with - - - - - - - -time - -is - -steady - -but - -very - -slow, - -and - -never - -reaches - -the - - - - - - - -level - -of - -performance - -obtained - -with - -the - -exact-TSP - - - - - - - -procedure, - -even - -when - -increasing - -the - -time - -by - -sev- - - - - - - - -
-
- - -338 - - - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - - -eral - -orders - -of - -magnitude. - -Also - -to - -be - -noted - -is - -that - - - - - - - -the - -solution - -obtained - -by - -the - -exact-TSP - -is - -provably - - - - - - - -the - -optimum, - -which - -is - -almost - -never - -the - -case - -of - - - - - - -the - -beam-search - -procedure. - -In - -Figure - -5b, - -we - -re- - - - -port - -the - -BLEU - -score - -of - -the - -reordered - -sentences - - - - - - - -in - -the - -test - -set - -relative - -to - -the - -original - -reference - - - - - - -sentences. - -Here - -we - -see - -that - -the - -exact-TSP - -out- - - - -puts - -are - -closer - -to - -the - -references - -in - -terms - -of - -BLEU - - - - - - - -than - -the - -beam-search - -solutions. - -Although - -the - -TSP - - - - - - - -output - -does - -not - -recover - -the - -reference - -sentences - - - - - - - -(it - -produces - -sentences - -with - -a - -slightly - -higher - -LM - - - - - - - -score - -than - -the - -references), - -it - -does - -reconstruct - -the - - - - - - -references - -better - -than - -the - -beam-search. - -The - -ex- - - - -periments - -with - -trigram - -language - -models - -(Figures - - - - - - - -5(c,d)) - -show - -similar - -trends - -to - -those - -with - -bigrams. - - - - - - - - - -5.2 - -Translation - -experiments - -with - -a - -bigram - - - - - - - -language - -model - - - - - - - - - -In - -this - -section - -we - -consider - -two - -real - -translation - - - - - - - -tasks, - -namely, - -translation - -from - -English - -to - -French, - - - - - - -trained - -on - -Europarl - -(Koehn - -et - -al., - -2003) - -and - -trans- - - - -lation - -from - -German - -to - -Spanish - -training - -on - -the - - - - - - -NewsCommentary - -corpus. - -For - -Europarl, - -the - -train- - - - -ing - -set - -includes - -2.81 - -million - -sentences, - -and - -the - - - - - - - -test - -set - -500. - -For - -NewsCommentary - -the - -training - - - - - - - -set - -is - -smaller: - -around - -63k - -sentences, - -with - -a - -test - - - - - - - -set - -of - -500 - -sentences. - -Figure - -6 - -presents - -Decoder - - - - - - - -and - -Bleu - -scores - -as - -functions - -of - -time - -for - -the - -two - - - - - - -corpuses. - - - - - - - - -Since - -in - -the - -real - -translation - -task, - -the - -size - -of - -the - - - - - - -TSP - -graph - -is - -much - -larger - -than - -in - -the - -artificial - -re- - - - -ordering - -task - -(in - -our - -experiments - -the - -median - -size - - - - - - -of - -the - -TSP - -graph - -was - -around - -400 - -nodes, - -some- - - -times - -growing - -up - -to - -2000 - -nodes), - -directly - -apply- - - -ing - -the - -exact - -TSP - -solver - -would - -take - -too - -long; - -in- - - - -stead - -we - -use - -the - -approximate - -LK - -algorithm - -and - - - - - - - -compare - -it - -to - -Beam-Search. - -The - -efficiency - -of - -the - - - - - - -LK - -algorithm - -can - -be - -significantly - -increased - -by - -us- - - - -ing - -a - -good - -initialization. - -To - -compare - -the - -quality - -of - - - - - - - -the - -LK - -and - -Beam-Search - -methods - -we - -take - -a - -rough - - - - - - -initial - -solution - -produced - -by - -the - -Beam-Search - -al- - - - -gorithm - -using - -a - -small - -value - -for - -the - -stack - -size - -and - - - - - - -then - -use - -it - -as - -initial - -point, - -both - -for - -the - -LK - -algo- - - - -rithm - -and - -for - -further - -Beam-Search - -optimization - - - - - - -(where - -as - -before - -we - -vary - -the - -Beam-Search - -thresh- - - - -olds - -in - -order - -to - -trade - -quality - -for - -time). - - - - - - - - - -In - -the - -case - -of - -the - -Europarl - -corpus, - -we - -observe - - - - - - - -that - -LK - -outperforms - -Beam-Search - -in - -terms - -of - -the - - - - - - - -Decoder - -score - -as - -well - -as - -in - -terms - -of - -the - -BLEU - - - - - - -score. - -Note - -that - -the - -difference - -between - -the - -two - -al- - - - -gorithms - -increases - -steeply - -at - -the - -beginning, - -which - - - - - - - - - - -means - -that - -we - -can - -significantly - -increase - -the - -qual- - - - -ity - -of - -the - -Beam-Search - -solution - -by - -using - -the - -LK - - - - - - - -algorithm - -at - -a - -very - -small - -price. - -In - -addition, - -it - -is - - - - - - - -important - -to - -note - -that - -the - -BLEU - -scores - -obtained - -in - - - - - - - -these - -experiments - -correspond - -to - -feature - -weights, - - - - - - -in - -the - -log-linear - -model - -(1), - -that - -have - -been - -opti- - - - -mized - -for - -the - -Moses - -decoder, - -but - -not - -for - -the - -TSP - - - - - - - -decoder: - -optimizing - -these - -parameters - -relatively - -to - - - - - - - -the - -TSP - -decoder - -could - -improve - -its - -BLEU - -scores - - - - - - - -still - -further. - - - - - - - - - -On - -the - -News - -corpus, - -again, - -LK - -outperforms - - - - - - - -Beam-Search - -in - -terms - -of - -the - -Decoder - -score. - -The - - - - - - - -situation - -with - -the - -BLEU - -score - -is - -more - -confuse. - - - - - - -Both - -algorithms - -do - -not - -show - -any - -clear - -score - -im- - - - -provement - -with - -increasing - -running - -time - -which - - - - - - - -suggests - -that - -the - -decoder’s - -objective - -function - -is - - - - - - - -not - -very - -well - -correlated - -with - -the - -BLEU - -score - -on - - - - - - - -this - -corpus. - - - - - - - - - -6 - -Future - -Work - - - - - - - - - -In - -section - -4.1, - -we - -described - -a - -general - -“compiling - - - - - - - -out” - -method - -for - -extending - -our - -TSP - -representation - - - - - - - -to - -handling - -trigram - -and - -N-gram - -language - -models, - - - - - - -but - -we - -noted - -that - -the - -method - -may - -lead - -to - -combi- - - - -natorial - -explosion - -of - -the - -TSP - -graph. - -While - -this - - - - - - -problem - -was - -manageable - -for - -the - -artificial - -mono- - - -lingual - -word - -re-ordering - -(which - -had - -only - -one - -pos- - - -sible - -translation - -for - -each - -source - -word), - -it - -be- - - -comes - -unwieldy - -for - -the - -real - -translation - -experi- - - -ments, - -which - -is - -why - -in - -this - -paper - -we - -only - -consid- - - - -ered - -bigram - -LMs - -for - -these - -experiments. - -However, - - - - - - - -we - -know - -how - -to - -handle - -this - -problem - -in - -principle, - - - - - - -and - -we - -now - -describe - -a - -method - -that - -we - -plan - -to - -ex- - - - -periment - -with - -in - -the - -future. - - - - - - - - - -To - -avoid - -the - -large - -number - -of - -artificial - -biphrases - - - - - - - -as - -in - -4. - -1, - -we - -perform - -an - - - -adaptive - - - -selection - -. - - - -Let - -us - - - - - - - -suppose - -that - - - - - -( - -w, - - - - - -b - -) - - - -is - -a - -SMT - -decoding - -graph - -node, - - - - - - -where - - -b - - - -is - -a - -biphrase - -containing - -only - -one - -word - -on - - - - - - - -the - -target - -side. - -On - -the - -first - -step, - -when - -we - -evaluate - - - - - - - -the - -traveling - -cost - -from - - - - - -( - -w, - - - - - -b - -) - - - -to - - - - - -( - -w - -' - -, - - - - - -b - -' - -) - -, - - - -we - -take - - - - - - - -the - -language - -model - -component - -equal - -to - - - - - - - - - -— - - -log - - - - -p - -( - -b - -' - -.v - -I - -b.e, - - - - - -b - -'' - -. - - - - - -e - -), - - - - - - - - -where - - - - -b - -' - -.v - - - -represents - -the - -first - -word - -of - -the - - - - - -b - -' - - - -tar- - - - - -get - -side, - - - -b. - -e - - - -is - -the - -only - -word - -of - -the - - - -b - - - -target - - - - - - - - - -side, - -and - - - - - -b - -'' - -.e - - - -is - -the - -last - -word - -of - -the - - - - - -b - -'' - - - -tar - - - - - -get - -size. - -This - -procedure - -underestimates - -the - -total - - - - - - - -cost - -of - -tour - -passing - -through - -biphrases - -that - -have - -a - - - - - - - -single-word - -target. - -Therefore - -if - -the - -optimal - -tour - - - - - - - -passes - -only - -through - -biphrases - -with - -more - -than - -one - - - - - - - -
-
- - -min - - - - - - - - -b - -" - -ďż˝ - -b - -' - -,b - - - - - - -
-
- - -339 - - - - - -
- -
-
- - - - - - - -en - - - - - - -
- - - - -
- - -
-
- - -−271 - - - - - - - -−271.5 - - - - - - - -−272 - - - - - - - -−272.5 - - - - - -
-
- - - - - -−27310 - -3 - - - - - -10 - -4 - - - - - -10 - -5 - - - - - - -
-
- - -BEAM−SEARCH - - - - - - -TSP - -(LK) - - - - - - -
-
- - - -Time - -(sec) - - - - - - -
-
- - -
-
- - -0.19 - - - - - -
-
- - -0.185 - - - - - -
-
- - - - - - - - - - -0.18 - - - -3 - -4 - -5 - - - - - - - - - - - - - -10 - -10 - -10 - - - - - - -
-
- - -BEAM−SEARCH - - - - - - -TSP - -(LK) - - - - - - -
-
- - - -Time - -(sec) - - - - - - -
-
- - -
-
- - - - - - -−413 - - - - - -−413.2 - - - - - -−413.4 - - - - - -−413.6 - - - - - -−413.8 - - - - - - - - - - - -−414 - - - -3 - -4 - - - - - - - - - - - - -10 - -10 - - - - - - -
-
- - - - - - - - - - - - - -
-
- - - - - - - -TSP - -(LK) - - - - - - -BEAM−SEARCH - - - - - -
-
- - - -Time - -(sec) - - - - - - -
-
- - - - -
- - - - - - - -TSP - -(LK) - - - - - - -BEAM−SEARCH - - - - - -
-
- - -
- -
- - - - - - - - - -0.242 - - - -3 - -4 - - - - - - - - - - - - -10 - -10 - - - - - - - - - -Time - -(sec) - - - - - - -
-
- - -
-
- - -0.245 - - - - - -0.244 - - - - - -0.243 - - - - - -
-
- - -
- -
- - - - -(a) - -(b) - -(c) - -(d) - - - - - - - - -Figure - -6: - -(a), - -(b): - -Europarl - -corpus, - -translation - -from - -English - -to - -French; - -(c),(d): - -NewsCommentary - -cor- - - - -pus, - -translation - -from - -German - -to - -Spanish. - -Average - -value - -of - -the - -decoder - -and - -the - -BLEU - -scores - -(over - -500 - - - - - - - -test - -sentences) - -as - -a - -function - -of - -time. - -The - -trade-off - -quality/time - -in - -the - -case - -of - -LK - -is - -controlled - -by - -the - - - - - - - -number - -of - -iterations, - -and - -each - -point - -corresponds - -to - -a - -particular - -number - -of - -iterations, - -in - -our - -experiments - - - - - - - -LK - -was - -run - -with - -a - -number - -of - -iterations - -varying - -between - -2k - -and - -170k. - -The - -same - -trade-off - -in - -the - -case - -of - - - - - - - -Beam-Search - -is - -controlled - -by - -varying - -the - -beam - -thresholds. - - - - - - - -
-
- - - - -word - -on - -their - -target - -side, - -then - -we - -are - -sure - -that - - - - - - - -this - -tour - -is - -also - -optimal - -in - -terms - -of - -the - -tri-gram - - - - - - - -language - -model. - -Otherwise, - -if - -the - -optimal - -tour - - - - - - - - - -passes - -through - - - - - -( - -w, - - - - - -b - -) - -, - - - -where - - - -b - - - -is - -a - -biphrase - -hav- - - - - - - - - - -ing - -a - -single-word - -target, - -we - -add - -only - -the - -extended - - - - - - - -biphrases - -related - -to - - - -b - - - -as - -we - -described - -in - -section - - - - - - -4. - -1, - -and - -then - -we - -recompute - -the - -optimal - -tour. - -Iter- - - -ating - -this - -procedure - -provably - -converges - -to - -an - -op- - - - -timal - -solution. - - - - - - - - - -This - -powerful - -method, - -which - -was - -proposed - -in - - - - - - - -(Kam - -and - -Kopec, - -1996; - -Popat - -et - -al., - -2001) - -in - -the - - - - - - - -context - -of - -a - -finite-state - -model - -(but - -not - -of - -TSP), - - - - - - - -can - -be - -easily - -extended - -to - -N-gram - -situations, - -and - - - - - - -typically - -converges - -in - -a - -small - -number - -of - -itera- - - -tions. - - - - - - - - -7 - -Conclusion - - - - - - - - - -The - -main - -contribution - -of - -this - -paper - -has - -been - -to - - - - - - -propose - -a - -transformation - -for - -an - -arbitrary - -phrase- - - - - -based - -SMT - -decoding - -instance - -into - -a - -TSP - -instance. - - - - - - - -While - -certain - -similarities - -of - -SMT - -decoding - -and - - - - - - - -TSP - -were - -already - -pointed - -out - -in - -(Knight, - -1999), - - - - - - - -where - -it - -was - -shown - -that - -any - -Traveling - -Salesman - - - - - - - -Problem - -may - -be - -reformulated - -as - -an - -instance - -of - - - - - - -a - -(simplistic) - -SMT - -decoding - -task, - -and - -while - -cer- - - - -tain - -techniques - -used - -for - -TSP - -were - -then - -adapted - -to - - - - - - - -word-based - -SMT - -decoding - -(Germann - -et - -al., - -2001; - - - - - - - -Tillmann - -and - -Ney, - -2003; - -Tillmann, - -2006), - -we - -are - - - - - - - -not - -aware - -of - -any - -previous - -work - -that - -shows - -that - - - - - - - -SMT - -decoding - -can - -be - -directly - -reformulated - -as - -a - - - - - - -TSP. - -Beside - -the - -general - -interest - -of - -this - -transfor- - - - -mation - -for - -understanding - -decoding, - -it - -also - -opens - - - - - - -the - -door - -to - -direct - -application - -of - -the - -variety - -of - -ex- - - - -isting - -TSP - -algorithms - -to - -SMT. - -Our - -experiments - - - - - - -on - -synthetic - -and - -real - -data - -show - -that - -fast - -TSP - -al- - - - -gorithms - -can - -handle - -selection - -and - -reordering - -in - - - - - - - - - - -SMT - -comparably - -or - -better - -than - -the - -state-of-the- - - - - -art - -beam-search - -strategy, - -converging - -on - -solutions - - - - - - - -with - -higher - -objective - -function - -in - -a - -shorter - -time. - - - - - - - - -The - -proposed - -method - -proceeds - -by - -first - -con- - - - -structing - -an - -AGTSP - -instance - -from - -the - -decoding - - - - - - - -problem, - -and - -then - -converting - -this - -instance - -first - - - - - - - -into - -ATSP - -and - -finally - -into - -STSP. - -At - -this - -point, - -a - - - - - - - -direct - -application - -of - -the - -well - -known - -STSP - -solver - - - - - - -Concorde - - -(with - -Lin-Kernighan - -heuristic) - -already - - - - - - - -gives - -good - -results. - -We - -believe - -however - -that - -there - - - - - - -might - -exist - -even - -more - -efficient - -alternatives. - -In- - - - -stead - -of - -converting - -the - -AGTSP - -instance - -into - -a - - - - - - -STSP - -instance, - -it - -might - -prove - -better - -to - -use - -di- - - - -rectly - -algorithms - -expressly - -designed - -for - -ATSP - - - - - - - -or - -AGTSP. - -For - -instance, - -some - -of - -the - -algorithms - - - - - - - -tested - -in - -the - -context - -of - -the - - - -DIMACS - - - -implemen- - - - - -tation - -challenge - -for - -ATSP - -(Johnson - -et - -al., - -2002) - - - - - - -might - -well - -prove - -superior. - -There - -is - -also - -active - -re- - - - -search - -around - -AGTSP - -algorithms. - -Recently - -new - - - - - - - -effective - -methods - -based - -on - -a - -“memetic” - -strategy - - - - - - - -(Buriol - -et - -al., - -2004; - -Gutin - -et - -al., - -2008) - -have - -been - - - - - - - -put - -forward. - -These - -methods - -combined - -with - -our - - - - - - - -proposed - -formulation - -provide - -ready-to-use - -SMT - - - - - - - -decoders, - -which - -it - -will - -be - -interesting - -to - -compare. - - - - - - - - -Acknowledgments - - - - - - - - -Thanks - -to - -Vassilina - -Nikoulina - -for - -her - -advice - -about - - - - - - - -running - -Moses - -on - -the - -test - -datasets. - - - - - - - -
-
- - -340 - - - - - -
- -
-
- - - - - - - -en - - - - - - - - -
- - - -References - - - - - - - - -David - -L. - -Applegate, - -Robert - -E. - -Bixby, - -Vasek - -Chvatal, - - - - - - - -and - -William - -J. - -Cook. - -2005. - -Concorde - - - - - - - -tsp - -solver. - - - -http://www.tsp.gatech.edu/ - - - - - - - -concorde.html - -. - - - - - - - - - -David - -L. - -Applegate, - -Robert - -E. - -Bixby, - -Vasek - -Chvatal, - - - - - - - -and - -William - -J. - -Cook. - -2007. - - - -The - -Traveling - -Sales- - - - - -man - -Problem: - -A - -Computational - -Study - -(Princeton - - - - - - - -Series - -in - -Applied - - - -Mathematics) - -. - - - -Princeton - -Univer- - - - - -sity - -Press, - -January. - - - - - - - - - -Luciana - -Buriol, - -Paulo - -M. - - - -Franc - -¸ - -a, - - - -and - -Pablo - -Moscato. - - - - - - - -2004. - -A - -new - -memetic - -algorithm - -for - -the - -asymmetric - - - - - - - -traveling - -salesman - -problem. - - - -Journal - -of - - - -Heuristics - -, - - - - - - -10(5):483–506. - - - - - - - - -Chris - -Callison-Burch, - -Philipp - -Koehn, - -Christof - -Monz, - - - - - - -Josh - -Schroeder, - -and - -Cameron - -Shaw - -Fordyce, - -edi- - - - -tors. - -2008. - - - -Proceedings - -of - -the - -Third - -Workshop - -on - - - - - - - -SMT - -. - - - -ACL, - -Columbus, - -Ohio, - -June. - - - - - - - - - -Ulrich - -Germann, - -Michael - -Jahr, - -Kevin - -Knight, - -and - - - - - - - -Daniel - -Marcu. - -2001. - -Fast - -decoding - -and - -optimal - - - - - - - -decoding - -for - -machine - -translation. - -In - - - -In - -Proceedings - - - - - - - -ofACL - - - -39 - -, - - - -pages - -228–235. - - - - - - - - -Gregory - -Gutin, - -Daniel - -Karapetyan, - -and - -Krasnogor - -Na- - - - -talio. - -2008. - -Memetic - -algorithm - -for - -the - -generalized - - - - - - - -asymmetric - -traveling - -salesman - -problem. - -In - - - -NICSO - - - - - - - -2007 - -, - - - -pages - -199–210. - -Springer - -Berlin. - - - - - - - - -G. - -Gutin. - -2003. - -Travelling - -salesman - -and - -related - -prob- - - - -lems. - -In - - - -Handbook - -of - -Graph - - - -Theory - -. - - - - - - - - - -Hieu - -Hoang - -and - -Philipp - -Koehn. - -2008. - -Design - -of - -the - - - - - - - -Moses - -decoder - -for - -statistical - -machine - -translation. - -In - - - - - - - -ACL - -2008 - -Software - - - -workshop - -, - - - -pages - -58–65, - -Colum- - - - - -bus, - -Ohio, - -June. - -ACL. - - - - - - - - - -D.S. - -Johnson, - -G. - -Gutin, - -L.A. - -McGeoch, - -A. - -Yeo, - - - - - - -W. - -Zhang, - -and - -A. - -Zverovich. - -2002. - -Experimen- - - - -tal - -analysis - -of - -heuristics - -for - -the - -atsp. - -In - - - -The - -Trav- - - - - -elling - -Salesman - -Problem - -and - -Its - - - -Variations - -, - - - -pages - - - - - - -445–487. - - - - - - - - -Anthony - -C. - -Kam - -and - -Gary - -E. - -Kopec. - -1996. - -Document - - - - - - - -image - -decoding - -by - -heuristic - -search. - - - -IEEE - -Transac- - - - - -tions - -on - -Pattern - -Analysis - -and - -Machine - - - -Intelligence - -, - - - - - - -18:945–950. - - - - - - - -Kevin - -Knight. - -1999. - -Decoding - -complexity - -in - -word- - - - - -replacement - -translation - -models. - - - -Computational - - - - - - - -Linguistics - -, - - - -25:607–615. - - - - - - - - - -Philipp - -Koehn, - -Franz - -Josef - -Och, - -and - -Daniel - -Marcu. - - - - - - - -2003. - -Statistical - -phrase-based - -translation. - -In - - - - - - - -NAACL - - - -2003 - -, - - - -pages - -48–54, - -Morristown, - -NJ, - -USA. - - - - - - - -Association - -for - -Computational - -Linguistics. - - - - - - - - - -Adam - -Lopez. - -2008. - -Statistical - -machine - -translation. - - - - - - - -ACM - -Comput. - - - -Surv. - -, - - - -40(3):1–49. - - - - - - - - -C. - -Noon - -and - -J.C. - -Bean. - -1993. - -An - -efficient - -transforma- - - - -tion - -of - -the - -generalized - -traveling - -salesman - -problem. - - - - - - - -INFOR - -, - - - -pages - -39–44. - - - - - - - - - - - -Kishore - -Papineni, - -Salim - -Roukos, - -Todd - -Ward, - -and - - - - - - - -Wei - -J. - -Zhu. - -2001. - -BLEU: - -a - -Method - -for - -Automatic - - - - - - - -Evaluation - -of - -Machine - -Translation. - - - -IBM - -Research - - - - - - - -Report - -, - - - -RC22176. - - - - - - - - - -Kris - -Popat, - -Daniel - -H. - -Greene, - -Justin - -K. - -Romberg, - -and - - - - - - -Dan - -S. - -Bloomberg. - -2001. - -Adding - -linguistic - -con- - - - -straints - -to - -document - -image - -decoding: - -Comparing - - - - - - - -the - -iterated - -complete - -path - -and - -stack - -algorithms. - - - - - - - - -Christoph - -Tillmann - -and - -Hermann - -Ney. - -2003. - -Word - -re- - - - -ordering - -and - -a - -dynamic - -programming - -beam - -search - - - - - - - -algorithm - -for - -statistical - -machine - -translation. - - - -Com- - - - - -put. - - - -Linguist. - -, - - - -29(1):97–133. - - - - - - - - -Christoph - -Tillmann. - -2006. - -Efficient - -Dynamic - -Pro- - - - -gramming - -Search - -Algorithms - -For - -Phrase-Based - - - - - - - -SMT. - -In - - - -Workshop - -On - -Computationally - -Hard - -Prob- - - - - -lems - -And - -Joint - -Inference - -In - -Speech - -And - -Language - - - - - - - -Processing - -. - - - - - - - - -Wikipedia. - -2009. - -Travelling - -Salesman - -Problem - -— - - - -Wikipedia, - -The - -Free - -Encyclopedia. - -[Online; - -ac- - - - -cessed - -5-May-2009]. - - - - - - - -
-
- - -341 - - - - - -
- -
-
diff --git a/bin/34_1273675500_P09-1038.txt b/bin/34_1273675500_P09-1038.txt deleted file mode 100644 index ae59c8b..0000000 --- a/bin/34_1273675500_P09-1038.txt +++ /dev/null @@ -1,844 +0,0 @@ -Phrase-Based Statistical Machine Translation as a Traveling Salesman -Problem -Mikhail Zaslavskiy* Marc Dymetman Nicola Cancedda - Mines ParisTech, Institut Curie Xerox Research Centre Europe - 77305 Fontainebleau, France 38240 Meylan, France - mikhail.zaslavskiy@ensmp.fr {marc.dymetman,nicola.cancedda}@xrce.xerox.com -Abstract -An efficient decoding algorithm is a cru- -cial element of any statistical machine -translation system. Some researchers have -noted certain similarities between SMT -decoding and the famous Traveling Sales- -man Problem; in particular (Knight, 1999) -has shown that any TSP instance can be -mapped to a sub-case of a word-based -SMT model, demonstrating NP-hardness -of the decoding task. In this paper, we fo- -cus on the reverse mapping, showing that -any phrase-based SMT decoding problem -can be directly reformulated as a TSP. The -transformation is very natural, deepens our -understanding of the decoding problem, -and allows direct use of any of the pow- -erful existing TSP solvers for SMT de- -coding. We test our approach on three -datasets, and compare a TSP-based de- -coder to the popular beam-search algo- -rithm. In all cases, our method provides -competitive or better performance. -1 Introduction -Phrase-based systems (Koehn et al., 2003) are -probably the most widespread class of Statistical -Machine Translation systems, and arguably one of -the most successful. They use aligned sequences -of words, called biphrases, as building blocks for -translations, and score alternative candidate trans- -lations for the same source sentence based on a -log-linear model of the conditional probability of -target sentences given the source sentence: -p(T, a15) = 1 -ZS exp 1:Akhk(5, a, T) (1) -k -where the hk are features, that is, functions of the -source string 5, of the target string T, and of the -* This work was conducted during an internship at -XRCE. -alignment a, where the alignment is a representa- -tion of the sequence of biphrases that where used -in order to build T from 5; The �k’s are weights -and ZS is a normalization factor that guarantees -that p is a proper conditional probability distri- -bution over the pairs (T, A). Some features are -local, i.e. decompose over biphrases and can be -precomputed and stored in advance. These typ- -ically include forward and reverse phrase condi- -tional probability features log p(�t1s) as well as -logp(s1�t), where 9 is the source side of the -biphrase and t� the target side, and the so-called -“phrase penalty” and “word penalty” features, -which count the number of phrases and words in -the alignment. Other features are non-local, i.e. -depend on the order in which biphrases appear in -the alignment. Typical non-local features include -one or more n-gram language models as well as -a distortion feature, measuring by how much the -order of biphrases in the candidate translation de- -viates from their order in the source sentence. -Given such a model, where the �Z’s have been -tuned on a development set in order to minimize -some error rate (see e.g. (Lopez, 2008)), together -with a library of biphrases extracted from some -large training corpus, a decoder implements the -actual search among alternative translations: -(a*, T*) = arg max -(a,T) -The decoding problem (2) is a discrete optimiza- -tion problem. Usually, it is very hard to find the -exact optimum and, therefore, an approximate so- -lution is used. Currently, most decoders are based -on some variant of a heuristic left-to-right search, -that is, they attempt to build a candidate translation -(a, T) incrementally, from left to right, extending -the current partial translation at each step with a -new biphrase, and computing a score composed of -two contributions: one for the known elements of -the partial translation so far, and one a heuristic -P(T, a15). (2) -333 -Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 333–341, -Suntec, Singapore, 2-7 August 2009. c�2009 ACL and AFNLP -estimate of the remaining cost for completing the -translation. The variant which is mostly used is -a form of beam-search, where several partial can- -didates are maintained in parallel, and candidates -for which the current score is too low are pruned -in favor of candidates that are more promising. -We will see in the next section that some char- -acteristics of beam-search make it a suboptimal -choice for phrase-based decoding, and we will -propose an alternative. This alternative is based on -the observation that phrase-based decoding can be -very naturally cast as a Traveling Salesman Prob- -lem (TSP), one of the best studied problems in -combinatorial optimization. We will show that this -formulation is not only a powerful conceptual de- -vice for reasoning on decoding, but is also prac- -tically convenient: in the same amount of time, -off-the-shelf TSP solvers can find higher scoring -solutions than the state-of-the art beam-search de- -coder implemented in Moses (Hoang and Koehn, -2008). -2 Related work -Beam-search decoding -In beam-search decoding, candidate translation -prefixes are iteratively extended with new phrases. -In its most widespread variant, stack decoding, -prefixes obtained by consuming the same number -of source words, no matter which, are grouped to- -gether in the same stack1 and compete against one -another. Threshold and histogram pruning are ap- -plied: the former consists in dropping all prefixes -having a score lesser than the best score by more -than some fixed amount (a parameter of the algo- -rithm), the latter consists in dropping all prefixes -below a certain rank. -While quite successful in practice, stack decod- -ing presents some shortcomings. A first one is that -prefixes obtained by translating different subsets -of source words compete against one another. In -one early formulation of stack decoding for SMT -(Germann et al., 2001), the authors indeed pro- -posed to lazily create one stack for each subset -of source words, but acknowledged issues with -the potential combinatorial explosion in the num- -ber of stacks. This problem is reduced by the use -of heuristics for estimating the cost of translating -the remaining part of the source sentence. How- -1While commonly adopted in the speech and SMT com- -munities, this is a bit of a misnomer, since the used data struc- -tures are priority queues, not stacks. -ever, this solution is only partially satisfactory. On -the one hand, heuristics should be computationally -light, much lighter than computing the actual best -score itself, while, on the other hand, the heuris- -tics should be tight, as otherwise pruning errors -will ensue. There is no clear criterion to guide -in this trade-off. Even when good heuristics are -available, the decoder will show a bias towards -putting at the beginning the translation of a certain -portion of the source, either because this portion -is less ambiguous (i.e. its translation has larger -conditional probability) or because the associated -heuristics is less tight, hence more optimistic. Fi- -nally, since the translation is built left-to-right the -decoder cannot optimize the search by taking ad- -vantage of highly unambiguous and informative -portions that should be best translated far from the -beginning. All these reasons motivate considering -alternative decoding strategies. -Word-based SMT and the TSP -As already mentioned, the similarity between -SMT decoding and TSP was recognized in -(Knight, 1999), who focussed on showing that -any TSP can be reformulated as a sub-class of the -SMT decoding problem, proving that SMT decod- -ing is NP-hard. Following this work, the exis- -tence of many efficient TSP algorithms then in- -spired certain adaptations of the underlying tech- -niques to SMT decoding for word-based models. -Thus, (Germann et al., 2001) adapt a TSP sub- -tour elimination strategy to an IBM-4 model, us- -ing generic Integer Programming techniques. The -paper comes close to a TSP formulation of de- -coding with IBM-4 models, but does not pursue -this route to the end, stating that “It is difficult -to convert decoding into straight TSP, but a wide -range of combinatorial optimization problems (in- -cluding TSP) can be expressed in the more gen- -eral framework of linear integer programming”. -By employing generic IP techniques, it is how- -ever impossible to rely on the variety of more -efficient both exact and approximate approaches -which have been designed specifically for the TSP. -In (Tillmann and Ney, 2003) and (Tillmann, 2006), -the authors modify a certain Dynamic Program- -ming technique used for TSP for use with an IBM- -4 word-based model and a phrase-based model re- -spectively. However, to our knowledge, none of -these works has proposed a direct reformulation -of these SMT models as TSP instances. We be- -lieve we are the first to do so, working in our case -334 -with the mainstream phrase-based SMT models, -and therefore making it possible to directly apply -existing TSP solvers to SMT. -3 The Traveling Salesman Problem and -its variants -In this paper the Traveling Salesman Problem ap- -pears in four variants: -STSP. The most standard, and most studied, -variant is the Symmetric TSP: we are given a non- -directed graph G on N nodes, where the edges -carry real-valued costs. The STSP problem con- -sists in finding a tour of minimal total cost, where -a tour (also called Hamiltonian Circuit) is a “cir- -cular” sequence of nodes visiting each node of the -graph exactly once; -ATSP. The Asymmetric TSP, or ATSP, is a vari- -ant where the underlying graph G is directed and -where, for i and j two nodes of the graph, the -edges (i,j) and (j,i) may carry different costs. -SGTSP. The Symmetric Generalized TSP, or -SGTSP: given a non-oriented graph G of JGJ -nodes with edges carrying real-valued costs, given -a partition of these JGJ nodes into m non-empty, -disjoint, subsets (called clusters), find a circular -sequence of m nodes of minimal total cost, where -each cluster is visited exactly once. -AGTSP. The Asymmetric Generalized TSP, or -AGTSP: similar to the SGTSP, but G is now a di- -rected graph. -The STSP is often simply denoted TSP in the -literature, and is known to be NP-hard (Applegate -et al., 2007); however there has been enormous -interest in developing efficient solvers for it, both -exact and approximate. -Most of existing algorithms are designed for -STSP, but ATSP, SGTSP and AGTSP may be re- -duced to STSP, and therefore solved by STSP al- -gorithms. -3.1 Reductions AGTSP—*ATSP—*STSP -The transformation of the AGTSP into the ATSP, -introduced by (Noon and Bean, 1993)), is illus- -trated in Figure (1). In this diagram, we assume -that Y1, ... , YK are the nodes of a given cluster, -while X and Z are arbitrary nodes belonging to -other clusters. In the transformed graph, we in- -troduce edges between the Y�’s in order to form a -cycle as shown in the figure, where each edge has -a large negative cost —K. We leave alone the in- -coming edge to Y� from X, but the outgoing edge -Figure 1: AGTSP—*ATSP. -from Y� to X has its origin changed to Y�_1. A -feasible tour in the original AGTSP problem pass- -ing through X, Y�, Z will then be “encoded” as a -tour of the transformed graph that first traverses -X , then traverses Y�, ... , YK, ... , Y�_1, then tra- -verses Z (this encoding will have the same cost as -the original cost, minus (k — 1)K). Crucially, if -K is large enough, then the solver for the trans- -formed ATSP graph will tend to traverse as many -K edges as possible, meaning that it will traverse -exactly k — 1 such edges in the cluster, that is, it -will produce an encoding of some feasible tour of -the AGTSP problem. -As for the transformation ATSP—*STSP, several -variants are described in the literature, e.g. (Ap- -plegate et al., 2007, p. 126); the one we use is from -(Wikipedia, 2009) (not illustrated here for lack of -space). -3.2 TSP algorithms -TSP is one of the most studied problems in com- -binatorial optimization, and even a brief review of -existing approaches would take too much place. -Interested readers may consult (Applegate et al., -2007; Gutin, 2003) for good introductions. -One of the best existing TSP solvers is imple- -mented in the open source Concorde package (Ap- -plegate et al., 2005). Concorde includes the fastest -exact algorithm and one of the most efficient im- -plementations of the Lin-Kernighan (LK) heuris- -tic for finding an approximate solution. LK works -by generating an initial random feasible solution -for the TSP problem, and then repeatedly identi- -fying an ordered subset of k edges in the current -tour and an ordered subset of k edges not included -in the tour such that when they are swapped the -objective function is improved. This is somewhat -335 -reminiscent of the Greedy decoding of (Germann -et al., 2001), but in LK several transformations can -be applied simultaneously, so that the risk of being -stuck in a local optimum is reduced (Applegate et -al., 2007, chapter 15). -As will be shown in the next section, phrase- -based SMT decoding can be directly reformulated -as an AGTSP. Here we use Concorde through -first transforming AGTSP into STSP, but it might -also be interesting in the future to use algorithms -specifically designed for AGTSP, which could im- -prove efficiency further (see Conclusion). -4 Phrase-based Decoding as TSP -In this section we reformulate the SMT decoding -problem as an AGTSP. We will illustrate the ap- -proach through a simple example: translating the -French sentence “cette traduction automatique est -curieuse ” into English. We assume that the rele- -vant biphrases for translating the sentence are as -follows: -ID -source -target -h -cette -this -t -traduction -translation -ht -cette traduction -this translation -mt -traduction automatique -machine translation -a -automatique -automatic -m -automatique -machine -i -est -is -s -curieuse -strange -c -curieuse -curious -Under this model, we can produce, among others, -the following translations: -h mt i s this machine translation is strange -h c t i a this curious translation is automatic -ht s i a this translation strange is automatic -where we have indicated on the left the ordered se- -quence of biphrases that leads to each translation. -We now formulate decoding as an AGTSP, in -the following way. The graph nodes are all the -possible pairs (w, b), where w is a source word in -the source sentence s and b is a biphrase contain- -ing this source word. The graph clusters are the -subsets of the graph nodes that share a common -source word w. -The costs of a transition between nodes M and -N of the graph are defined as follows: -(a) If M is of the form (w, b) and N of the form -(w', b), in which b is a single biphrase, and w and -w' are consecutive words in b, then the transition -cost is 0: once we commit to using the first word -of b, there is no additional cost for traversing the -other source words covered by b. -(b) If M = (w, b), where w is the rightmost -source word in the biphrase b, and N = (w', b'), -where w' =� w is the leftmost source word in b', -then the transition cost corresponds to the cost -of selecting b' just after b; this will correspond -to “consuming” the source side of b' after having -consumed the source side of b (whatever their rel- -ative positions in the source sentence), and to pro- -ducing the target side of b' directly after the target -side of b; the transition cost is then the addition of -several contributions (weighted by their respective -A (not shown), as in equation 1): -• The cost associated with the features local to -b in the biphrase library; -• The “distortion” cost of consuming the -source word w' just after the source word w: -1pos(w') — pos(w) — 11, where pos(w) and -pos(w') are the positions of w and w' in the -source sentence. -• The language model cost of producing the -target words of b' right after the target words -of b; with a bigram language model, this cost -can be precomputed directly from b and b'. -This restriction to bigram models will be re- -moved in Section 4.1. -(c) In all other cases, the transition cost is infinite, -or, in other words, there is no edge in the graph -between M and N. -A special cluster containing a single node (de- -noted by $-$$ in the figures), and corresponding to -special beginning-of-sentence symbols must also -be included: the corresponding edges and weights -can be worked out easily. Figures 2 and 3 give -some illustrations of what we have just described. -4.1 From Bigram to N-gram LM -Successful phrase-based systems typically employ -language models of order higher than two. How- -ever, our models so far have the following impor- -tant “Markovian” property: the cost of a path is -additive relative to the costs of transitions. For -example, in the example of Figure 3, the cost of -this • machine translation • is • strange, can only -take into account the conditional probability of the -word strange relative to the word is, but not rela- -tive to the words translation and is. If we want to -extend the power of the model to general n-gram -language models, and in particular to the 3-gram -336 -Figure 2: Transition graph for the source sentence -cette traduction automatique est curieuse. Only -edges entering or exiting the node traduction — mt -are shown. The only successor to [traduction — -mt] is [automatique — mt], and [cette — ht] is not a -predecessor of [traduction — mt]. -Figure 3: A GTSP tours is illustrated, correspond- -ing to the displayed output. -case (on which we concentrate here, but the tech- -niques can be easily extended to the general case), -the following approach can be applied. -Compiling Out for Trigram models -This approach consists in “compiling out” all -biphrases with a target side of only one word. -We replace each biphrase b with single-word tar- -get side by “extended” biphrases bi, ... , br, which -are “concatenations” of b and some other biphrase -b� in the library.2 To give an example, consider -that we: (1) remove from the biphrase library the -biphrase i, which has a single word target, and (2) -add to the library the extended biphrases mti, ti, -si, ..., that is, all the extended biphrases consist- -ing of the concatenation of a biphrase in the library -with i, then it is clear that these extended biphrases -will provide enough context to compute a trigram -probability for the target word produced immedi- -ately next (in the examples, for the words strange, -2In the figures, such “concatenations” are denoted by -[b' • b] ; they are interpreted as encapsulations of first con- -suming the source side of b', whether or not this source side -precedes the source side of b in the source sentence, produc- -ing the target side of b', consuming the source side of b, and -producing the target side of b immediately after that of b'. -Figure 4: Compiling-out of biphrase i: (est,is). -automatic and automatic respectively). If we do -that exhaustively for all biphrases (relevant for the -source sentence at hand) that, like i, have a single- -word target, we will obtain a representation that -allows a trigram language model to be computed -at each point. -The situation becomes clearer by looking at Fig- -ure 4, where we have only eliminated the biphrase -i, and only shown some of the extended biphrases -that now encapsulate i, and where we show one -valid circuit. Note that we are now able to as- -sociate with the edge connecting the two nodes -(est, mti) and (curieuse, s) a trigram cost because -mti provides a large enough target context. -While this exhaustive “compiling out” method -works in principle, it has a serious defect: if for -the sentence to be translated, there are m relevant -biphrases, among which k have single-word tar- -gets, then we will create on the order of km ex- -tended biphrases, which may represent a signif- -icant overhead for the TSP solver, as soon as k -is large relative to m, which is typically the case. -The problem becomes even worse if we extend the -compiling-out method to n-gram language models -with n > 3. In the Future Work section below, -we describe a powerful approach for circumvent- -ing this problem, but with which we have not ex- -perimented yet. -5 Experiments -5.1 Monolingual word re-ordering -In the first series of experiments we consider the -artificial task of reconstructing the original word -order of a given English sentence. First, we ran- -domly permute words in the sentence, and then -we try to reconstruct the original order by max- -337 -Time (sec) -−0.1 -−0.2 -−0.3 -−0.4100 102 104 -0.1 -0 -BEAM−SEARCH -TSP -−0.8100 102 104 -Time (sec) -0.2 -BEAM−SEARCH -TSP -0 -−0.2 -−0.4 -−0.6 -(a) (b) (c) (d) -Figure 5: (a), (b): LM and BLEU scores as functions of time for a bigram LM; (c), (d): the same for -a trigram LM. The x axis corresponds to the cumulative time for processing the test set; for (a) and (c), -the y axis corresponds to the mean difference (over all sentences) between the lm score of the output -and the lm score of the reference normalized by the sentence length N: (LM(ref)-LM(true))/N. The solid -line with star marks corresponds to using beam-search with different pruning thresholds, which result in -different processing times and performances. The cross corresponds to using the exact-TSP decoder (in -this case the time to the optimal solution is not under the user’s control). -imizing the LM score over all possible permuta- -tions. The reconstruction procedure may be seen -as a translation problem from “Bad English” to -“Good English”. Usually the LM score is used -as one component of a more complex decoder -score which also includes biphrase and distortion -scores. But in this particular “translation task” -from bad to good English, we consider that all -“biphrases” are of the form e — e, where e is an -English word, and we do not take into account -any distortion: we only consider the quality of -the permutation as it is measured by the LM com- -ponent. Since for each “source word” e, there is -exactly one possible “biphrase” e — e each clus- -ter of the Generalized TSP representation of the -decoding problem contains exactly one node; in -other terms, the Generalized TSP in this situation -is simply a standard TSP. Since the decoding phase -is then equivalent to a word reordering, the LM -score may be used to compare the performance -of different decoding algorithms. Here, we com- -pare three different algorithms: classical beam- -search (Moses); a decoder based on an exact TSP -solver (Concorde); a decoder based on an approx- -imate TSP solver (Lin-Kernighan as implemented -in the Concorde solver) 3. In the Beam-search -and the LK-based TSP solver we can control the -trade-off between approximation quality and run- -ning time. To measure re-ordering quality, we use -two scores. The first one is just the “internal” LM -score; since all three algorithms attempt to maxi- -mize this score, a natural evaluation procedure is -to plot its value versus the elapsed time. The sec- -3 Both TSP decoders may be used with/or without a distor- -tion limit; in our experiments we do not use this parameter. -ond score is BLEU (Papineni et al., 2001), com- -puted between the reconstructed and the original -sentences, which allows us to check how well the -quality of reconstruction correlates with the inter- -nal score. The training dataset for learning the LM -consists of 50000 sentences from NewsCommen- -tary corpus (Callison-Burch et al., 2008), the test -dataset for word reordering consists of 170 sen- -tences, the average length of test sentences is equal -to 17 words. -Bigram based reordering. First we consider -a bigram Language Model and the algorithms try -to find the re-ordering that maximizes the LM -score. The TSP solver used here is exact, that is, -it actually finds the optimal tour. Figures 5(a,b) -present the performance of the TSP and Beam- -search based methods. -Trigram based reordering. Then we consider -a trigram based Language Model and the algo- -rithms again try to maximize the LM score. The -trigram model used is a variant of the exhaustive -compiling-out procedure described in Section 4.1. -Again, we use an exact TSP solver. -Looking at Figure 5a, we see a somewhat sur- -prising fact: the cross and some star points have -positive y coordinates! This means that, when us- -ing a bigram language model, it is often possible -to reorder the words of a randomly permuted ref- -erence sentence in such a way that the LM score -of the reordered sentence is larger than the LM of -the reference. A second notable point is that the -increase in the LM-score of the beam-search with -time is steady but very slow, and never reaches the -level of performance obtained with the exact-TSP -procedure, even when increasing the time by sev- -338 -eral orders of magnitude. Also to be noted is that -the solution obtained by the exact-TSP is provably -the optimum, which is almost never the case of -the beam-search procedure. In Figure 5b, we re- -port the BLEU score of the reordered sentences -in the test set relative to the original reference -sentences. Here we see that the exact-TSP out- -puts are closer to the references in terms of BLEU -than the beam-search solutions. Although the TSP -output does not recover the reference sentences -(it produces sentences with a slightly higher LM -score than the references), it does reconstruct the -references better than the beam-search. The ex- -periments with trigram language models (Figures -5(c,d)) show similar trends to those with bigrams. -5.2 Translation experiments with a bigram -language model -In this section we consider two real translation -tasks, namely, translation from English to French, -trained on Europarl (Koehn et al., 2003) and trans- -lation from German to Spanish training on the -NewsCommentary corpus. For Europarl, the train- -ing set includes 2.81 million sentences, and the -test set 500. For NewsCommentary the training -set is smaller: around 63k sentences, with a test -set of 500 sentences. Figure 6 presents Decoder -and Bleu scores as functions of time for the two -corpuses. -Since in the real translation task, the size of the -TSP graph is much larger than in the artificial re- -ordering task (in our experiments the median size -of the TSP graph was around 400 nodes, some- -times growing up to 2000 nodes), directly apply- -ing the exact TSP solver would take too long; in- -stead we use the approximate LK algorithm and -compare it to Beam-Search. The efficiency of the -LK algorithm can be significantly increased by us- -ing a good initialization. To compare the quality of -the LK and Beam-Search methods we take a rough -initial solution produced by the Beam-Search al- -gorithm using a small value for the stack size and -then use it as initial point, both for the LK algo- -rithm and for further Beam-Search optimization -(where as before we vary the Beam-Search thresh- -olds in order to trade quality for time). -In the case of the Europarl corpus, we observe -that LK outperforms Beam-Search in terms of the -Decoder score as well as in terms of the BLEU -score. Note that the difference between the two al- -gorithms increases steeply at the beginning, which -means that we can significantly increase the qual- -ity of the Beam-Search solution by using the LK -algorithm at a very small price. In addition, it is -important to note that the BLEU scores obtained in -these experiments correspond to feature weights, -in the log-linear model (1), that have been opti- -mized for the Moses decoder, but not for the TSP -decoder: optimizing these parameters relatively to -the TSP decoder could improve its BLEU scores -still further. -On the News corpus, again, LK outperforms -Beam-Search in terms of the Decoder score. The -situation with the BLEU score is more confuse. -Both algorithms do not show any clear score im- -provement with increasing running time which -suggests that the decoder’s objective function is -not very well correlated with the BLEU score on -this corpus. -6 Future Work -In section 4.1, we described a general “compiling -out” method for extending our TSP representation -to handling trigram and N-gram language models, -but we noted that the method may lead to combi- -natorial explosion of the TSP graph. While this -problem was manageable for the artificial mono- -lingual word re-ordering (which had only one pos- -sible translation for each source word), it be- -comes unwieldy for the real translation experi- -ments, which is why in this paper we only consid- -ered bigram LMs for these experiments. However, -we know how to handle this problem in principle, -and we now describe a method that we plan to ex- -periment with in the future. -To avoid the large number of artificial biphrases -as in 4. 1, we perform an adaptive selection. Let us -suppose that (w, b) is a SMT decoding graph node, -where b is a biphrase containing only one word on -the target side. On the first step, when we evaluate -the traveling cost from (w, b) to (w', b'), we take -the language model component equal to -— log p(b'.vIb.e, b''. e), -where b'.v represents the first word of the b' tar- -get side, b. e is the only word of the b target -side, and b''.e is the last word of the b'' tar -get size. This procedure underestimates the total -cost of tour passing through biphrases that have a -single-word target. Therefore if the optimal tour -passes only through biphrases with more than one -min -b"�b',b -339 -−271 -−271.5 -−272 -−272.5 -−273103 104 105 -BEAM−SEARCH -TSP (LK) -Time (sec) -0.19 -0.185 - 0.18 3 4 5 - 10 10 10 -BEAM−SEARCH -TSP (LK) -Time (sec) -−413 -−413.2 -−413.4 -−413.6 -−413.8 - −414 3 4 - 10 10 -TSP (LK) -BEAM−SEARCH -Time (sec) -TSP (LK) -BEAM−SEARCH - 0.242 3 4 - 10 10 -Time (sec) -0.245 -0.244 -0.243 -(a) (b) (c) (d) -Figure 6: (a), (b): Europarl corpus, translation from English to French; (c),(d): NewsCommentary cor- -pus, translation from German to Spanish. Average value of the decoder and the BLEU scores (over 500 -test sentences) as a function of time. The trade-off quality/time in the case of LK is controlled by the -number of iterations, and each point corresponds to a particular number of iterations, in our experiments -LK was run with a number of iterations varying between 2k and 170k. The same trade-off in the case of -Beam-Search is controlled by varying the beam thresholds. -word on their target side, then we are sure that -this tour is also optimal in terms of the tri-gram -language model. Otherwise, if the optimal tour -passes through (w, b), where b is a biphrase hav- -ing a single-word target, we add only the extended -biphrases related to b as we described in section -4. 1, and then we recompute the optimal tour. Iter- -ating this procedure provably converges to an op- -timal solution. -This powerful method, which was proposed in -(Kam and Kopec, 1996; Popat et al., 2001) in the -context of a finite-state model (but not of TSP), -can be easily extended to N-gram situations, and -typically converges in a small number of itera- -tions. -7 Conclusion -The main contribution of this paper has been to -propose a transformation for an arbitrary phrase- -based SMT decoding instance into a TSP instance. -While certain similarities of SMT decoding and -TSP were already pointed out in (Knight, 1999), -where it was shown that any Traveling Salesman -Problem may be reformulated as an instance of -a (simplistic) SMT decoding task, and while cer- -tain techniques used for TSP were then adapted to -word-based SMT decoding (Germann et al., 2001; -Tillmann and Ney, 2003; Tillmann, 2006), we are -not aware of any previous work that shows that -SMT decoding can be directly reformulated as a -TSP. Beside the general interest of this transfor- -mation for understanding decoding, it also opens -the door to direct application of the variety of ex- -isting TSP algorithms to SMT. Our experiments -on synthetic and real data show that fast TSP al- -gorithms can handle selection and reordering in -SMT comparably or better than the state-of-the- -art beam-search strategy, converging on solutions -with higher objective function in a shorter time. -The proposed method proceeds by first con- -structing an AGTSP instance from the decoding -problem, and then converting this instance first -into ATSP and finally into STSP. At this point, a -direct application of the well known STSP solver -Concorde (with Lin-Kernighan heuristic) already -gives good results. We believe however that there -might exist even more efficient alternatives. In- -stead of converting the AGTSP instance into a -STSP instance, it might prove better to use di- -rectly algorithms expressly designed for ATSP -or AGTSP. For instance, some of the algorithms -tested in the context of the DIMACS implemen- -tation challenge for ATSP (Johnson et al., 2002) -might well prove superior. There is also active re- -search around AGTSP algorithms. Recently new -effective methods based on a “memetic” strategy -(Buriol et al., 2004; Gutin et al., 2008) have been -put forward. These methods combined with our -proposed formulation provide ready-to-use SMT -decoders, which it will be interesting to compare. -Acknowledgments -Thanks to Vassilina Nikoulina for her advice about -running Moses on the test datasets. -340 -References -David L. Applegate, Robert E. Bixby, Vasek Chvatal, -and William J. Cook. 2005. Concorde -tsp solver. http://www.tsp.gatech.edu/ -concorde.html. -David L. Applegate, Robert E. Bixby, Vasek Chvatal, -and William J. Cook. 2007. The Traveling Sales- -man Problem: A Computational Study (Princeton -Series in Applied Mathematics). Princeton Univer- -sity Press, January. -Luciana Buriol, Paulo M. Franc¸a, and Pablo Moscato. -2004. A new memetic algorithm for the asymmetric -traveling salesman problem. Journal of Heuristics, -10(5):483–506. -Chris Callison-Burch, Philipp Koehn, Christof Monz, -Josh Schroeder, and Cameron Shaw Fordyce, edi- -tors. 2008. Proceedings of the Third Workshop on -SMT. ACL, Columbus, Ohio, June. -Ulrich Germann, Michael Jahr, Kevin Knight, and -Daniel Marcu. 2001. Fast decoding and optimal -decoding for machine translation. In In Proceedings -ofACL 39, pages 228–235. -Gregory Gutin, Daniel Karapetyan, and Krasnogor Na- -talio. 2008. Memetic algorithm for the generalized -asymmetric traveling salesman problem. In NICSO -2007, pages 199–210. Springer Berlin. -G. Gutin. 2003. Travelling salesman and related prob- -lems. In Handbook of Graph Theory. -Hieu Hoang and Philipp Koehn. 2008. Design of the -Moses decoder for statistical machine translation. In -ACL 2008 Software workshop, pages 58–65, Colum- -bus, Ohio, June. ACL. -D.S. Johnson, G. Gutin, L.A. McGeoch, A. Yeo, -W. Zhang, and A. Zverovich. 2002. Experimen- -tal analysis of heuristics for the atsp. In The Trav- -elling Salesman Problem and Its Variations, pages -445–487. -Anthony C. Kam and Gary E. Kopec. 1996. Document -image decoding by heuristic search. IEEE Transac- -tions on Pattern Analysis and Machine Intelligence, -18:945–950. -Kevin Knight. 1999. Decoding complexity in word- -replacement translation models. Computational -Linguistics, 25:607–615. -Philipp Koehn, Franz Josef Och, and Daniel Marcu. -2003. Statistical phrase-based translation. In -NAACL 2003, pages 48–54, Morristown, NJ, USA. -Association for Computational Linguistics. -Adam Lopez. 2008. Statistical machine translation. -ACM Comput. Surv., 40(3):1–49. -C. Noon and J.C. Bean. 1993. An efficient transforma- -tion of the generalized traveling salesman problem. -INFOR, pages 39–44. -Kishore Papineni, Salim Roukos, Todd Ward, and -Wei J. Zhu. 2001. BLEU: a Method for Automatic -Evaluation of Machine Translation. IBM Research -Report, RC22176. -Kris Popat, Daniel H. Greene, Justin K. Romberg, and -Dan S. Bloomberg. 2001. Adding linguistic con- -straints to document image decoding: Comparing -the iterated complete path and stack algorithms. -Christoph Tillmann and Hermann Ney. 2003. Word re- -ordering and a dynamic programming beam search -algorithm for statistical machine translation. Com- -put. Linguist., 29(1):97–133. -Christoph Tillmann. 2006. Efficient Dynamic Pro- -gramming Search Algorithms For Phrase-Based -SMT. In Workshop On Computationally Hard Prob- -lems And Joint Inference In Speech And Language -Processing. -Wikipedia. 2009. Travelling Salesman Problem — -Wikipedia, The Free Encyclopedia. [Online; ac- -cessed 5-May-2009]. -341 diff --git a/bin/BiblioScript/README.md b/bin/BiblioScript/README.md new file mode 100644 index 0000000..c9ad0aa --- /dev/null +++ b/bin/BiblioScript/README.md @@ -0,0 +1,26 @@ +BiblioScript +============ + +## Dependencies ## +(For the installation of dependencies see project's websites) + +* BibUtils: v. 4.8 +* ParsCit: v. 100401d +* Saxon He 9.2.1.2 + +## Installation ## + +Once cloned the git repo just change the path to executables in file biblio_script.sh accordingly to the local settings: + + # paths to executables + PARSCIT_PATH="/Applications/ParsCit/bin/" + BIBUTILS_PATH="/Applications/bibutils_4.8/" + SAXON_PATH="/Applications/saxonhe9-2-1-2j/saxon9he.jar" + +## Usage ## + +You can try out the script using the provieded input_sample.txt file: + ./biblio_script.sh + +## Caveat ## +The degree of accuracy of the resulting .bib file is depending (and it might vary from version to version) on the ParsCit engine used to parse the unstructured (plain text) bibliography. \ No newline at end of file diff --git a/bin/BiblioScript/biblio_script.sh b/bin/BiblioScript/biblio_script.sh new file mode 100755 index 0000000..31ddc1d --- /dev/null +++ b/bin/BiblioScript/biblio_script.sh @@ -0,0 +1,139 @@ +#!/usr/bin/env python +# Author: Matteo Romanello, + +import os,sys,getopt,re + +# paths to executables +# Thang v100901: minor modifications in the code so that it doesn't matter if the below directory paths end with / or not +PARSCIT_PATH=sys.path[0] + "/../../bin/" +BIBUTILS_PATH=sys.path[0] + "/bibutils_4.10" +SAXON_PATH=sys.path[0] + "/saxonhe9-2-1-2j/saxon9he.jar" + +# paths to resources +XSLT_TRANFORM_PATH=sys.path[0] + "/parscit2mods.xsl" + +def parscit_to_mods(parscit_out, is_quiet): + saxon_cmd="java -jar %s -xsl:%s -s:%s" %(SAXON_PATH,XSLT_TRANFORM_PATH,parscit_out) + out=os.popen(saxon_cmd).readlines() + if is_quiet == "no": + print "Transforming Parscit's output into mods xml..." + return out + +def export_mods(mods_xml, out_type, is_quiet): + bibutils_cmd="%s/xml2%s %s"%(BIBUTILS_PATH, out_type, mods_xml) # Thang v100901: modify to add multiple export format + + if is_quiet == "yes": bibutils_cmd = "%s 2>/dev/null" %(bibutils_cmd) + out=os.popen(bibutils_cmd).readlines() + return out + +def usage(): + print "Usage: %s [-h] [-q] [-i ] [-o ] " %(sys.argv[0]) + print "Options:" + print "\t-h\tPrint this message" + print "\t-q\tDo not pritn log message" + print "\t-i \tType=\"all\" (full-text input),\"ref\" (input contains only individual reference strings, one per line), \"xml\" (Omnipage XML input), \"parscit\" (ParsCit citation output), or \"mods\" (MODS file) (default=\"ref\")" + print "\t-o \tType=(ads|bib|end|isi|ris|wordbib) (default=bib)" + +# Thang v100901: process argv array using getopt +def process_argv(argv): + try: + opts, args = getopt.getopt(argv[1:], "hqi:o:", ["help", "quiet", "input=", "output="]) + except getopt.GetoptError, err: + print str(err) + usage() + sys.exit(2) + + in_type = "ref" + out_type = "bib" + is_quiet = "no" + + for o, a in opts: + if o in ("-h", "--help"): + usage() + sys.exit() + elif o in ("-q", "--quiet"): + is_quiet = "yes" + elif o in ("-i", "--input"): + in_type = a + if(not re.match("(all|ref|xml|parscit|mods)", in_type)): + sys.stderr.write("#! in_type \"%s\" does not match (all|ref|mods)\n" % in_type) + sys.exit(1) + elif o in ("-o", "--output"): + out_type = a + + if(not re.match("(ads|bib|end|isi|ris|word)", out_type)): + sys.stderr.write("#! Output type \"%s\" does not match(ads|bib|end|isi|ris|wordbib)\n" % out_type) + sys.exit(1) + + else: + assert False, "unhandled in_type" + + # get inp_file, out_dir & check validity + inp_file = "" + out_dir = "" + if(len(args) > 1): + inp_file = args[0] + out_dir = args[1] + else: + usage() + sys.exit(1) + + if is_quiet == "no": sys.stderr.write("# (in_type, outputType, inputFile, outDir) = (\"%s\", \"%s\", \"%s\", \"%s\")\n" %(in_type, out_type, inp_file, out_dir)) + + # check if the input file exists + if not os.path.isfile(inp_file): + sys.stderr.write("#! File \"%s\" doesn't exist\n" % inp_file) + sys.exit(1) + + # check if directory exists, create if not: + if not os.path.exists(out_dir): + if is_quiet == "no": sys.stderr.write("#! Directory \"%s\" doesn't exist. Creating ...\n" % out_dir) + os.makedirs(out_dir) + + return (out_type, in_type, inp_file, out_dir, is_quiet) +# End Thang v100901: process argv array + +############ +### MAIN ### +############ +(out_type, in_type, inp_file, out_dir, is_quiet) = process_argv(sys.argv) +if is_quiet == "no": print "# Extracting references from the input file... " + +# Thang v100901: handle in_type +if (in_type == "ref"): + parscit_out = os.popen("%s/parseRefStrings.pl %s" %(PARSCIT_PATH,inp_file)).readlines() +elif(in_type == "all"): + parscit_out = os.popen("%s/citeExtract.pl -m extract_citations %s" %(PARSCIT_PATH,inp_file)).readlines() +elif(in_type == "xml"): + parscit_out = os.popen("%s/citeExtract.pl -m extract_citations -i xml %s" %(PARSCIT_PATH,inp_file)).readlines() + +if(in_type != "mods" and in_type != "parscit"): + parscit_xml='%s/parscit_temp.xml'%out_dir + file = open(parscit_xml,'w') + for line in parscit_out: + file.write(line) + file.close() +elif(in_type == "parscit"): + parscit_xml = inp_file + + +# transform parscit's output into mods 3.x +if(in_type != "mods"): + parscit_mods='%s/parscit_mods.xml'%out_dir + file = open(parscit_mods,'w') + for line in parscit_to_mods(parscit_xml, is_quiet): + file.write(line) + file.close() +else: # already an MODS file, copy over + parscit_mods = inp_file + +# transform mods intermediate xml into other export format +# Thang v100901: modify to handle multiple format +export_file='%s/parscit.%s' %(out_dir, out_type) +if is_quiet == "no": print "# Transforming intermediate mods xml into %s format. Output to %s ..." % (out_type, export_file) + +file = open(export_file,'w') +for line in export_mods(parscit_mods, out_type, is_quiet): + file.write(line) +file.close() + diff --git a/bin/BiblioScript/bibutils_4.10/bib2xml b/bin/BiblioScript/bibutils_4.10/bib2xml new file mode 100755 index 0000000..105246a Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/bib2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/biblatex2xml b/bin/BiblioScript/bibutils_4.10/biblatex2xml new file mode 100755 index 0000000..67a5829 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/biblatex2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/copac2xml b/bin/BiblioScript/bibutils_4.10/copac2xml new file mode 100755 index 0000000..5ded976 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/copac2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/ebi2xml b/bin/BiblioScript/bibutils_4.10/ebi2xml new file mode 100755 index 0000000..e298326 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/ebi2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/end2xml b/bin/BiblioScript/bibutils_4.10/end2xml new file mode 100755 index 0000000..c73b04f Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/end2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/endx2xml b/bin/BiblioScript/bibutils_4.10/endx2xml new file mode 100755 index 0000000..6e7b811 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/endx2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/isi2xml b/bin/BiblioScript/bibutils_4.10/isi2xml new file mode 100755 index 0000000..746e77a Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/isi2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/med2xml b/bin/BiblioScript/bibutils_4.10/med2xml new file mode 100755 index 0000000..52ef963 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/med2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/modsclean b/bin/BiblioScript/bibutils_4.10/modsclean new file mode 100755 index 0000000..d4746f6 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/modsclean differ diff --git a/bin/BiblioScript/bibutils_4.10/ris2xml b/bin/BiblioScript/bibutils_4.10/ris2xml new file mode 100755 index 0000000..7fa5b6b Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/ris2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/wordbib2xml b/bin/BiblioScript/bibutils_4.10/wordbib2xml new file mode 100755 index 0000000..fb84916 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/wordbib2xml differ diff --git a/bin/BiblioScript/bibutils_4.10/xml2ads b/bin/BiblioScript/bibutils_4.10/xml2ads new file mode 100755 index 0000000..10939f8 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/xml2ads differ diff --git a/bin/BiblioScript/bibutils_4.10/xml2bib b/bin/BiblioScript/bibutils_4.10/xml2bib new file mode 100755 index 0000000..8e7e9e1 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/xml2bib differ diff --git a/bin/BiblioScript/bibutils_4.10/xml2end b/bin/BiblioScript/bibutils_4.10/xml2end new file mode 100755 index 0000000..8e26567 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/xml2end differ diff --git a/bin/BiblioScript/bibutils_4.10/xml2isi b/bin/BiblioScript/bibutils_4.10/xml2isi new file mode 100755 index 0000000..0b81c48 Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/xml2isi differ diff --git a/bin/BiblioScript/bibutils_4.10/xml2ris b/bin/BiblioScript/bibutils_4.10/xml2ris new file mode 100755 index 0000000..9b6474d Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/xml2ris differ diff --git a/bin/BiblioScript/bibutils_4.10/xml2wordbib b/bin/BiblioScript/bibutils_4.10/xml2wordbib new file mode 100755 index 0000000..a1a2d4c Binary files /dev/null and b/bin/BiblioScript/bibutils_4.10/xml2wordbib differ diff --git a/bin/BiblioScript/git_commit.sh b/bin/BiblioScript/git_commit.sh new file mode 100755 index 0000000..f63ffab --- /dev/null +++ b/bin/BiblioScript/git_commit.sh @@ -0,0 +1,38 @@ +#!/bin/sh + +# Author: Luong Minh Thang , generated at Sun, 01 Jun 2008 15:21:09 + +date + +projName="origin" +branch="master" +if [ $# -gt "0" ] +then + comment=$1 + if [ $# -gt "1" ] + then + projName=$2 + + if [ $# -gt "2" ] + then + branch=$3 + fi + fi +else + echo "Usage ./git_commit.sh [] []" + echo " : optional, default=wing.nus" + echo " : optional (must specify projName before using branch), default=master" + exit +fi + +echo "git add ." +git add . + +echo "git commit -a -m \"$comment\"" +git commit -a -m "$comment" + +echo "git pull $projName $branch" +git pull $projName $branch + +echo "git push $projName $branch" +git push $projName $branch diff --git a/bin/BiblioScript/input_sample.txt b/bin/BiblioScript/input_sample.txt new file mode 100644 index 0000000..74dfaf9 --- /dev/null +++ b/bin/BiblioScript/input_sample.txt @@ -0,0 +1,22 @@ +Maretti, E., e Zarri, G.P. (1966). Su un’applicazione dei calcolatori relativa alla collatio codicum: un ausilio moderno per l’edizione critica dei testi. Istituto Lombardo (Rendiconti Lettere) 100: 321-332. +Maretti, E., and Zarri, G.P. (1967). Collatio Codicum: An Exercise in COMIT Programming. La ricerca scientifica 37: 608-611. +Zarri, G.P. (1967). Une expérience pour l’automatisation des recherches en papyrologie. Revue du LASLA (4): 55-85. +Maretti, E., and Zarri, G.P. (1968). A Preliminary Program for Computer Application to Papyrology. La ricerca scientifica 38: 1125-1130. +Maretti, E., and Zarri, G.P. (1968). A Computer Approach to Dom Quentin’s method of recensio. La ricerca scientifica 38: 1333-1337. +Zarri, G.P. (1968). Linguistica algoritmica e meccanizzazione della collatio codicum. Lingua e stile: 3, 21-40. +Baldacci, P., Maretti, E., and Zarri, G.P. (1969). Preliminaries to a New Automated Edition of C.I.L. V. La ricerca scientifica 39: 288-296. +Zarri, G.P. (1969). Il metodo per la recensio di Dom H. Quentin esaminato criticamente mediante la sua traduzione in un algoritmo per elaboratore elettronico. Lingua e stile 4: 161-182. +Heyler, A., Leclant, J., Maretti, E., et Zarri, G.P. (1970). Problèmes relatifs à l’enregistrement et au traitement de documents épigraphiques rédigés dans une langue très imparfaitement connue, le méroîtique. Dans: Archéologie et calculateurs, problèmes sémiologiques et mathématiques, Gardin, J.-C., éd. Paris: Editions du CNRS. +Maretti, E., and Zarri, G.P. (1970). Papyrology as an Investigation Field of Algorithmic Linguistics. In: Proceedings of the Twelfth International Congress of Papyrology, Samuel, D.H., ed. Toronto: A.M. Hakkert Ltd. +Maretti, E., e Zarri, G.P. (1971). L’arte dell’edizione critica è da meccanizzare? In: Actele celui de-al XII-lea Congres International di Lingvistica si Filologie Romanica, 2e vol. Bucuresti: Editura Academiei Republicii Socialiste Romania. +Maretti, E., e Zarri, G.P. (1971). Prospettive d’impiego dei calcolatori nelle ricerche linguistiche e di papirologia. Istituto Lombardo (Rendiconti Lettere) 105: 3-20. +Zarri, G.P. (1971). L’automazione delle procedure di critica testuale, problemi e prospettive. Lingua e stile 6: 397-414. +Zarri, G.P. (1972). Su alcuni problemi di metodo nelle tecniche di critica testuale. Pensiero e linguaggio 3: 131-145. +Zarri, G.P. (1973). Algorithms, Stemmata Codicum, and the Theories of Dom H. Quentin. In: The Computer and Literary Studies, Aitken, A.J., et al., eds. Edinburgh: University Press. +Zarri, G.P. (1974). Une étude quentinienne sur la tradition manuscrite de la Copa: Revue du LASLA (1): 1-16. +Zarri, G.P. (1976). A Computer Model for Textual Criticism?” In: The Computer in Literary and Linguistic Studies, Jones, A., and Churchhouse, R.F., ed. Cardiff: University of Wales Press. +Baldacci, P., Cavagnola, B., Ianovitz, O., Maretti, E., Masperi, G., Michelotto, P.G., and Zarri, G.P. (1977). POLEMON I: A Program of Automatic Construction of Indexes for the Fifth Volume of Corpus Inscriptionum Latinarum. In: Computational and Mathematical Linguistics - Proceedings of the 5th International Conference on Computational Linguistics, COLING’73, Zampolli, A., and Calzolari, N., eds. Firenze: Leo S. Olschki. +Irigoin, J., et Zarri, G.P., éds. (1979). La pratique des ordinateurs dans la critique des textes. Paris: Editions du CNRS. +Zarri, G.P. (1979). Une méthode de dérivation quentinienne pour la constitution semi-automatique d’une généalogie de manuscrits: premier bilan. Dans: La pratique des ordinateurs dans la critique des textes, Irigoin, J., et Zarri, G.P., éds. Paris: Editions du CNRS. +Borsetta, P.F., and Zarri, G.P. (1981). An Application of the QUENTIN/80 Software to the Study of the Manuscript Tradition of the Appendix Vergiliana. Dans: Actes du Congrès International Informatique et Sciences Humaines. Liège: LASLA.. +Zarri, G.P. (1989). Some Experiments on Automated Textual Criticism. In: Miscellanea di studi in onore di Aurelio Roncaglia. Modena: Mucchi Editore. \ No newline at end of file diff --git a/bin/BiblioScript/parscit2mods.xsl b/bin/BiblioScript/parscit2mods.xsl new file mode 100644 index 0000000..bb32cd8 --- /dev/null +++ b/bin/BiblioScript/parscit2mods.xsl @@ -0,0 +1,209 @@ + + + Matteo Romanello + + + + http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-2.xsd + ### JOURNAL ARTICLES ### + + ### BOOKS ### + + ### BOOK SECTIONS ### + + + + Here is where most of the inetersting stuff happen. + + + + + + + + text + + + + + + + host + + + continuing + + + + marc + journal + + + academic journal + + + + + + + + + + monographic + + + + + + + + host + + + book_editor + + + monographic + + + collection + + + + + citekey + + + + + Auhtors of a journal article + + + journal_article + + + Auhtors of a book + + + book + + + + + book_section + + + Handles the creation of name elements in mods format. The current mode. + + + + + personal + + + + + given + + + family + + + + + + + + marcrelator + text + + + author + + + creator + + + author + + + editor + + + + + + + + + + + + + + + + + + + + + + + page + + + + + + + + + + + + volume + + + + + + + + + + + + + + text + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/doc/img/saxonica_logo.gif b/bin/BiblioScript/saxonhe9-2-1-2j/doc/img/saxonica_logo.gif new file mode 100644 index 0000000..8f0bd8d Binary files /dev/null and b/bin/BiblioScript/saxonhe9-2-1-2j/doc/img/saxonica_logo.gif differ diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/doc/index.html b/bin/BiblioScript/saxonhe9-2-1-2j/doc/index.html new file mode 100644 index 0000000..ce4293a --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/doc/index.html @@ -0,0 +1,56 @@ + + + + + Saxonica: XSLT and XQuery Processing: Welcome + + + + + + + +
+
+
+
+
Saxonica.com
+ +
+

Welcome to Saxon

+ + +

Online Documentation

+ +

Saxon documentation for the current release is available online:

+ + + + +

Downloads

+ +

Saxon documentation, together with source code and sample applications + can also be downloaded, both for the current release and for earlier releases. +

+ +

The same file saxon-resources8-N.zip covers both Saxon products + (Saxon-B and Saxon-SA), and both platforms (Java and .NET).

+ +

The file also contains sample applications and Saxon-B source code.

+ + + + +
+ + \ No newline at end of file diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/doc/saxondocs.css b/bin/BiblioScript/saxonhe9-2-1-2j/doc/saxondocs.css new file mode 100644 index 0000000..681c337 --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/doc/saxondocs.css @@ -0,0 +1,228 @@ + + +/* +Text blue: #3D5B96 +Dark blue: #c1cede +Mid blue: #e4eef0 +Light blue: #f6fffb +mid green #B1CCC7 +rust #96433D +*/ + +/* used for frameset holders */ +.bgnd { + margin-top:0; + margin-left:0; + background: #f6fffb; + } + +/* used for menu */ + +.menu { + background: #f6fffb; + margin-top:20; + margin-left:40; + SCROLLBAR-FACE-COLOR: #c1cede; + SCROLLBAR-HIGHLIGHT-COLOR: #e4eef0; + SCROLLBAR-SHADOW-COLOR: #e4eef0; + SCROLLBAR-ARROW-COLOR: #f6fffb; + SCROLLBAR-BASE-COLOR: #e4eef0; +} + +/* used for content pages */ + +.main { + background: #e4eef0; + margin-top:10px; + margin-left:5px; + margin-right:5px; + margin-bottom:20px; + SCROLLBAR-FACE-COLOR: #c1cede; + SCROLLBAR-HIGHLIGHT-COLOR: #e4eef0; + SCROLLBAR-SHADOW-COLOR: #e4eef0; + SCROLLBAR-ARROW-COLOR: #f6fffb; + SCROLLBAR-BASE-COLOR: #e4eef0; +} + +/* used for menu links */ + +a { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 8pt; + font-style:normal; + color: #3D5B96; + font-weight: normal; + text-decoration: none; +} + +/* used for in body links */ + +a.bodylink { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 9pt; + font-style:normal; + color: #3D5B96; + font-weight: normal; + text-decoration: underline; +} + +/* used for table of contents level 1 */ + +a.toc1 { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 12pt; + font-style:normal; + color: #3D5B96; + font-weight: bold; + text-decoration: none; +} + +/* used for table of contents level 2 */ + +a.toc2 { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 10pt; + font-style:normal; + color: #3D5B96; + font-weight: normal; + text-decoration: none; +} + +/* used for menu heading */ +.title { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 10pt; + font-style:normal; + color: #3D5B96; + font-weight: bold; + text-decoration: none; + line-height: 1.3em; +} + +/* used for main page headings */ + + +h1 { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 14pt; + font-style: normal; + color: #3D5B96; + font-weight: bold; + text-decoration: none; + } + +/* used for subheads in pref. to H2 etc, to limit underlining width */ + +.subhead { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 10pt; + font-style: normal; + color: #3D5B96; + font-weight: bold; + text-decoration: none; + border-bottom : thin dashed #3D5B96; + padding-right : 5px; +} + +/* used for standard text */ + +p { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 9pt; + font-style: normal; + color: #3D5B96; + font-weight: normal; + text-decoration: none; + line-height: 1.3em; + padding-right:15px; +} + +code { + font-family: lucida sans typewriter, courier, monospace; + font-size: 8pt; + font-style: normal; + font-weight: normal; + text-decoration: none; + line-height: 1.3em; +} + +ul { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 9pt; + font-style: normal; + color: #3D5B96; + font-weight: normal; + text-decoration: none; +} + +li { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 9pt; + font-style: normal; + color: #3D5B96; + font-weight: normal; + +} + +/* used for text in boxed areas */ + +.boxed { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 8pt; + font-style: normal; + color: #96433D; + font-weight: bold; + text-decoration: none; + margin-top:5px; + margin-bottom:5px; +} + +/* used for example code */ + +.codeblock { + background: #B1CCC7; + /*background: #e4eef0;*/ + font-family: lucida sans typewriter, courier, monospace; + font-size: 8pt; + font-style: normal; + color: #96433D; + font-weight: normal; + text-decoration: none; + padding-right:15px; +} + +/* used for example commands */ + +.command { + font-size: 8pt; + font-style: normal; + color: #96433D; + font-weight: bold; + text-decoration: none; + padding-right:15px; +} + + + +/* used for links in boxed areas */ + +a.rust { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 8pt; + font-style:normal; + color: #96433D; + font-weight: bold; + text-decoration: underline; +} + +/* used for links at the end of a page */ + +a.nav { + font-family: Verdana, Arial, Helvetica, sans-serif; + font-size: 8pt; + font-style:normal; + color: #96433D; + font-weight: bold; +} + + diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/notices/CERN.txt b/bin/BiblioScript/saxonhe9-2-1-2j/notices/CERN.txt new file mode 100644 index 0000000..a155b84 --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/notices/CERN.txt @@ -0,0 +1,13 @@ + +(This notice is included in the Saxon distribution because Saxon includes a QuickSort +module that was originally developed by Wolfgang Hoschek at CERN, and which was licensed +for use under the conditions specified here.) + + +Copyright Š 1999 CERN - European Organization for Nuclear Research. + +Permission to use, copy, modify, distribute and sell this software and its documentation for any purpose +is hereby granted without fee, provided that the above copyright notice appear in all copies and +that both that copyright notice and this permission notice appear in supporting documentation. +CERN makes no representations about the suitability of this software for any purpose. +It is provided "as is" without expressed or implied warranty. \ No newline at end of file diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/notices/JAMESCLARK.txt b/bin/BiblioScript/saxonhe9-2-1-2j/notices/JAMESCLARK.txt new file mode 100644 index 0000000..1eb55f7 --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/notices/JAMESCLARK.txt @@ -0,0 +1,31 @@ + +(This notice is included in the Saxon distribution because Saxon's XPath parser +was originally derived from an XPath parser written by James Clark and made available +under this license. The Saxon XPath parser has since diverged very substantially, but +there are traces of the original code still present.) + +Copyright (c) 1998, 1999 James Clark + +Permission is hereby granted, free of charge, to any person obtaining +a copy of this software and associated documentation files (the +"Software"), to deal in the Software without restriction, including +without limitation the rights to use, copy, modify, merge, publish, +distribute, sublicense, and/or sell copies of the Software, and to +permit persons to whom the Software is furnished to do so, subject to +the following conditions: + +The above copyright notice and this permission notice shall be included +in all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED ``AS IS'', WITHOUT WARRANTY OF ANY KIND, EXPRESS +OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL JAMES CLARK BE LIABLE FOR ANY CLAIM, DAMAGES OR +OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, +ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR +OTHER DEALINGS IN THE SOFTWARE. + +Except as contained in this notice, the name of James Clark shall +not be used in advertising or otherwise to promote the sale, use or +other dealings in this Software without prior written authorization +from James Clark. diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/notices/LEGAL.txt b/bin/BiblioScript/saxonhe9-2-1-2j/notices/LEGAL.txt new file mode 100644 index 0000000..702948d --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/notices/LEGAL.txt @@ -0,0 +1,33 @@ +LEGAL NOTICE + +This notice is issued to fulfil the requirements of the Mozilla Public License version 1.0 ("MPL 1.0") +sections 3.4(a) and 3.6. MPL 1.0 can be found at http://www.mozilla.org/MPL/MPL-1.0.html. + +Section 3.4(a) of MPL 1.0 states that any third party intellectual property rights in particular +functionality or code must be notified in a text file named LEGAL that is issued with the source code. Saxon +includes a number of such third party components, and the relevant claims are included in notices included +in the same directory as this notice. Although MPL 1.0 requires this notice to be included only with source +code, some of the third parties may also require notices to be included with executable code. Therefore, Saxon +executable code must not be distributed separately from this notice and all the accompanying third +party notices. The term "Distribution" here includes making the code available for download, and its +inclusion in download repositories such as Maven. + +Section 3.6 of MPL 1.0 states: + +You may distribute Covered Code in Executable form only if the requirements of Section 3.1-3.5 have +been met for that Covered Code, and if You include a notice stating that the Source Code version of +the Covered Code is available under the terms of this License, including a description of how and +where You have fulfilled the obligations of Section 3.2. + +Section 3.2 requires the Source Code of Covered Code to be made available via an accepted Electronic +Distribution Mechanism. + +The Source Code version of the Covered Code (that is, the source code of Saxon-HE) is available under the +terms of the Mozilla Public License version 1.0, and may be obtained from the Subversion repository +for the Saxon project on SourceForge, at https://sourceforge.net/svn/?group_id=29872. +The precise version of the Subversion source for a particular Saxon maintenance release can be +determined by referring to the release notes for the particular release in the SourceForge download area. + +Note that MPL 1.0 requires that any modifications to this source code must be made available under the terms +of the MPL "to anyone to whom you made an executable version available". As a courtesy, it is also requested +that you make such modifications available to Saxonica Limited. \ No newline at end of file diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/notices/LICENSE.txt b/bin/BiblioScript/saxonhe9-2-1-2j/notices/LICENSE.txt new file mode 100644 index 0000000..c57dd4c --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/notices/LICENSE.txt @@ -0,0 +1,15 @@ +The contents of these file are subject to the Mozilla Public License Version 1.0 (the "License"); +you may not use these files except in compliance with the License. You may obtain a copy of the +License at http://www.mozilla.org/MPL/ + +Software distributed under the License is distributed on an "AS IS" basis, +WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the +specific language governing rights and limitations under the License. + +The Original Code is all Saxon modules labelled with a notice referring to this license. + +The Initial Developer of the Original Code is Michael Kay, except where otherwise specified in an individual module. + +Portions created by other named contributors are copyright as identified in the relevant module. All Rights Reserved. + +Contributor(s) are listed in the documentation: see notices/contributors. \ No newline at end of file diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/notices/THAI.txt b/bin/BiblioScript/saxonhe9-2-1-2j/notices/THAI.txt new file mode 100644 index 0000000..827ffec --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/notices/THAI.txt @@ -0,0 +1,39 @@ + +(This notice is included in the Saxon distribution because Saxon +uses code for conversion of XML Schema Regular expressions to +Java/.NET regular expressions that was originally written by James +Clark and made available under this license. The Saxon version of +the code has been enhanced in various ways but is still recognizably +based on the original.) + +Copyright (c) 2001-2003 Thai Open Source Software Center Ltd +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + + Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in + the documentation and/or other materials provided with the + distribution. + + Neither the name of the Thai Open Source Software Center Ltd nor + the names of its contributors may be used to endorse or promote + products derived from this software without specific prior written + permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR +CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR +PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF +LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING +NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/notices/UNICODE.txt b/bin/BiblioScript/saxonhe9-2-1-2j/notices/UNICODE.txt new file mode 100644 index 0000000..cef1606 --- /dev/null +++ b/bin/BiblioScript/saxonhe9-2-1-2j/notices/UNICODE.txt @@ -0,0 +1,38 @@ + +(This notice is included in the Saxon distribution because Saxon +uses code performing Unicode Normalization that was originally written by Mark +Davis and made available under this license. The Saxon version of the +code has been enhanced in various minor ways but is still recognizably +based on the original. For details of modifications, see the comments in +the source code.) + + +COPYRIGHT AND PERMISSION NOTICE +Copyright Š 1991-2007 Unicode, Inc. All rights reserved. Distributed under the Terms of Use +in http://www.unicode.org/copyright.html. + +Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode +data files and any associated documentation (the "Data Files") or Unicode software and any +associated documentation (the "Software") to deal in the Data Files or Software without +restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, +and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or +Software are furnished to do so, provided that (a) the above copyright notice(s) and this +permission notice appear with all copies of the Data Files or Software, (b) both the above +copyright notice(s) and this permission notice appear in associated documentation, and +(c) there is clear notice in each modified Data File or in the Software as well as in the +documentation associated with the Data File(s) or Software that the data or software has +been modified. + +THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. +IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE +BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, +OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, +WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, +ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA +FILES OR SOFTWARE. + +Except as contained in this notice, the name of a copyright holder shall not be used +in advertising or otherwise to promote the sale, use or other dealings in these +Data Files or Software without prior written authorization of the copyright holder. diff --git a/bin/BiblioScript/saxonhe9-2-1-2j/saxon9he.jar b/bin/BiblioScript/saxonhe9-2-1-2j/saxon9he.jar new file mode 100644 index 0000000..f86f658 Binary files /dev/null and b/bin/BiblioScript/saxonhe9-2-1-2j/saxon9he.jar differ diff --git a/bin/BiblioScript/tmpDir/parscit.bib b/bin/BiblioScript/tmpDir/parscit.bib new file mode 100644 index 0000000..692a15b --- /dev/null +++ b/bin/BiblioScript/tmpDir/parscit.bib @@ -0,0 +1,68 @@ +@Article{d1e50, +author="Deerwester, S. +and Furnas, G. +and Landauer, T. +and Harshman, R.", +title="Indexing by Latent Semantic Anaysis", +journal="Journal of the American Society of Information Science", +pages="41--6" +} + +@Article{d1e260, +journal="Journal of Computer Science and Information Management" +} + +@Article{d1e288, +author="Wendlandt, E. +and Driscoll, R.", +title="Incorporating a semantic analysis into a document retrieval strategy", +journal="CACM", +pages="54--48" +} + +@Book{d1e87, +author="Halliday, M. A. K.", +title="An Introduction to Functional Grammar. Edward", +year="1985", +address="Arnold, London" +} + +@Book{d1e121, +author="Jang, S.", +title="Extracting Context from Unstructured Text Documents by Content Word Density", +year="1997" +} + +@Book{d1e202, +author="Shin, H.", +title="Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System", +year="1997" +} + +@Book{d1e236, +author="Shin, H. +and Stach, J.", +title="Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval", +year="1999" +} + +@InCollection{d1e7, +author="Boyd, R. +and Driscoll, J. +and Syu, I.", +title="incorporating Semantics Within a Connectionist Model and a Vector Processing Model", +booktitle="In Proceedings of the TREC-2", +year="1994", +pages="NIST." +} + +@InCollection{d1e167, +author="Moffat, A. +and Davis, R. +and Wilkinson, R. +and Zobel, J.", +title="Retrieval of Partial Documents", +booktitle="In Proceedings of TREC-2", +year="1994" +} + diff --git a/bin/BiblioScript/tmpDir/parscit_mods.xml b/bin/BiblioScript/tmpDir/parscit_mods.xml new file mode 100644 index 0000000..cfa0188 --- /dev/null +++ b/bin/BiblioScript/tmpDir/parscit_mods.xml @@ -0,0 +1,281 @@ + + + + + Indexing by Latent Semantic Anaysis + + text + + S + Deerwester + + author + + + + G + Furnas + + author + + + + T + Landauer + + author + + + + R + Harshman + + author + + + + + Journal of the American Society of Information Science + + + continuing + + + + 41 + 6 + + + journal + academic journal + + d1e50 + + + text + + + Journal of Computer Science and Information Management + + + continuing + + journal + academic journal + + d1e260 + + + + Incorporating a semantic analysis into a document retrieval strategy + + text + + E + Wendlandt + + author + + + + R + Driscoll + + author + + + + + CACM + + + continuing + + + + 54 + 48 + + + journal + academic journal + + d1e288 + + + + An Introduction to Functional Grammar. Edward + + text + + M + A + K + Halliday + + creator + + + + 1985 + + Arnold, London + + monographic + + d1e87 + + + + Extracting Context from Unstructured Text Documents by Content Word Density + + text + + S + Jang + + creator + + + + 1997 + monographic + + d1e121 + + + + Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System + + text + + H + Shin + + creator + + + + 1997 + monographic + + d1e202 + + + + Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval + + text + + H + Shin + + creator + + + + J + Stach + + creator + + + + 1999 + monographic + + d1e236 + + + + incorporating Semantics Within a Connectionist Model and a Vector Processing Model + + text + + R + Boyd + + author + + + + J + Driscoll + + author + + + + I + Syu + + author + + + + + In Proceedings of the TREC-2 + + + + NIST. + + + + + monographic + 1994 + + collection + + d1e7 + + + + Retrieval of Partial Documents + + text + + A + Moffat + + author + + + + R + Davis + + author + + + + R + Wilkinson + + author + + + + J + Zobel + + author + + + + + In Proceedings of TREC-2 + + + monographic + 1994 + + collection + + d1e167 + + + diff --git a/bin/W00-0102.out b/bin/BiblioScript/tmpDir/parscit_temp.xml similarity index 85% rename from bin/W00-0102.out rename to bin/BiblioScript/tmpDir/parscit_temp.xml index 65bf71b..5f988f3 100644 --- a/bin/W00-0102.out +++ b/bin/BiblioScript/tmpDir/parscit_temp.xml @@ -1,19 +1,5 @@ - - -Using Long Runs as Predictors of Semantic Coherence in a Partial Document Retrieval System -Hyopil Shin -Computing Research Laboratory, NMSU -
PO Box 30001 Las Cruces, NM, 88003
-hshin@crl.nmsu.edu -Jerrold F Stach -Computer Science Telecommunications, UMKC -
5100 Rockhill Road Kansas City, MO, 64110
-stach@cstp.umkc.edu -We propose a method for dealing with semantic complexities occurring in information retrieval systems on the basis of linguistic observations. Our method follows from an analysis indicating that long runs of content words appear in a stopped document cluster, and our observation that these long runs predominately originate from the prepositional phrase and subject complement positions and as such, may be useful predictors of semantic coherence. From this linguistic basis, we test three statistical hypotheses over a small collection of documents from different genre. By coordinating thesaurus semantic categories (SEMCATs) of the long run words to the semantic categories of paragraphs, we conclude that for paragraphs containing both long runs and short runs, the SEMCAT weight of long runs of content words is a strong predictor of the semantic coherence of the paragraph -
-
diff --git a/bin/BiblioScript/xslt_doc/docHtml.css b/bin/BiblioScript/xslt_doc/docHtml.css new file mode 100644 index 0000000..97a952a --- /dev/null +++ b/bin/BiblioScript/xslt_doc/docHtml.css @@ -0,0 +1,640 @@ +/*---------------------------------------- + Global +-----------------------------------------*/ + +body{ +} +body, table { + font-family:arial, helvetica, sans-serif; + font-size:12px; +} + +@media print{ + body, table { + font-size:10px; + } +} + +/*-------------------------------------------- + Source code in the instance, source or + annotations. +--------------------------------------------*/ +span.tEl { + color: #000096; + background-color:inherit; +} +span.tXSLEl { + color: #0064C8; + background-color:inherit; +} +span.tAN { + color: #F5844C; + background-color:inherit; +} +span.tAV { + color: #993300; + background-color:inherit; +} +span.tI { + color: #000000; + background-color:inherit; +} +span.tT { + color: #000000; + background-color:inherit; +} +span.tC { + color: #006400; + background-color:inherit; +} +span.tCD { + color: #008C00; + background-color:inherit; +} +span.tPI { + color: #8B26C9; + background-color:inherit; +} +span.tEn { + color: #969600; + background-color:inherit; +} +/* Title sections */ +span.qname{ + color:black; + background-color:inherit; +} + +span.titleTemplateName{ + background-image:url('img/tplN16.gif'); + background-position:left 0; + background-repeat:no-repeat; + padding-left:20px; + color:black; +} + +span.titleTemplateMatch { + color:black; + background-color:inherit; + margin-left:5px; + margin-right:10px; +} + +span.titleTemplateMode:before { + content:'['; + /*margin-left:10px;*/ +} +span.titleTemplateMode:after { + content:']'; +} +span.titleTemplateMode { + color:gray; + background-color:inherit; + font-style:italic; +} + +/* Template reference sections*/ +span.nRf{ + background-image:url('img/tplN12.gif'); + background-repeat:no-repeat; + padding-left:14px; + margin-right:4px; +} + +span.mRfI{ + padding-left:14px; + margin-right:4px; +} + +span.mRf{ + margin-right:4px; +} + +span.cRfI{ + padding-left:14px; +} + +span.mdRf:before { + content:'['; + margin-left:4px; +} +span.mdRf:after{ + content:']'; +} +span.mdRf{ + color:gray; + font-style:italic; +} + +/* Indent the wrapping lines to the right. + Only the first line is at 0, the rest are + some distance to the right.*/ +.rt_content div[id]{ + /*margin-left:30px;*/ +} + +/*----------------------------------------- + Documentation sections. +------------------------------------------*/ + +div.cmpT { + font-size:1.4em; + font-weight:bold; + text-align:left; + margin-top:1.4em; + margin-bottom:0.7em; +} +div.cmpT{ +/* color:rgb(255, 160, 100);*/ + color:#333333; + background-color:inherit; +} + + +/* Tables. */ + +td, th { + padding:2px 2px 2px 5px; + text-align:left; + vertical-align:top; +} + +tr > th { + background-color:rgb(206, 239, 174); + color:inherit; +} + +/* Contrast for the titles*/ +table.component { + width:100%; + border-spacing:1px; +} + +@media print{ + table.component{ + border:1px solid gray; + border-collapse:collapse; + } + + table.component td{ + border:1px solid gray; + } +} + + + +table.component td.fCol{ +/* pink */ + /*background-color:#FFC0C0;*/ +/*green */ + background-color:rgb(210, 240, 180); +/*bleu*/ + /*background-color:#89C6E2;*/ +/*orange*/ + /*background-color:#FFD697;*/ +/*brown*/ + /*background-color:#D5BC8E;*/ +/*lilla*/ + /*background-color:#DDDDFF;*/ +/*gray-bleu*/ + /*background-color:#CAD0DD;*/ +/*brown-light*/ + /*background-color:#DECFB8;*/ +/*gray-green*/ + /*background-color:#C6D0CD;*/ +/*bleu-2*/ + /*background-color:#B5D5FF;*/ +/*gray*/ + /*background-color:#CCCCCC;*/ + + +/*bleu +background-color:#C4DAF4; +*/ + + + + color:black; + width:12%; +} + +table.component table td.fCol{ + border:none; + background-color:rgb(225, 245, 206); + color: inherit; +} + +td.fCol b{ + font-weight:normal; +} + + +/* The Name and Expand/Collapse control are on the same line + but at different ends.*/ +td.fCol div.flL{ + float:left; +} +td.fCol div.flR{ + float:right; +} + +/* Subtables */ +table.component table{ + width:100%; +} +table.component table, +table.component table td, +table.component table th{ + border:0; +} + + +/* Properties table */ +table.propertiesTable { + border-spacing:1px; +} +table.propertiesTable td.fCol{ + width:140px; + text-transform:capitalize; +} +/* Used by table */ +table.uBT { + border-spacing:1px; +} +table.uBT td.fCol{ + width:140px; + text-transform:capitalize; +} + +/* Facets table*/ +table.facetsTable { + border-spacing:1px; +} +table.facetsTable td.fCol{ + width:140px; + text-transform:capitalize; +} + +/* Attributes table */ +table.attsT { + border-spacing:1px; +} +table.attsT th{ + font-weight:normal; +} +table.attsT tr:hover{ + color:inherit; + background-color:rgb(225,245,206); +} + + +/* Identity constraints table */ +table.identityConstraintsTable { + border-spacing:1px; +} +table.identityConstraintsTable th{ + font-weight:normal; +} +table.identityConstraintsTable tr:hover{ + color:inherit; + background-color:rgb(225,245,206); +} + + + +/*--------------------------------------- + The diagram. +----------------------------------------*/ + +table.component td.diagram { + background-color:white; + color:inherit; +} + + +/* This table is a workaround for an IE bug regarding pre-wrap */ +table.pWCont, +table.pWCont td{ + border:0; + margin:0; + padding:0; +} + + +/* Annotations. */ +div.annotation{ +} +div.annotation pre{ + font-family:arial, helvetica, sans-serif; + margin:0; +} +div.annotation, +div.annotation table, +div.annotation table td{ + margin:0; + padding:0; +} + +/* Hierarchy */ +ul > li{ + list-style:none; +} + +ul { + margin:2px; + padding:0; +} + +ul ul li { + padding-left:10px; + + list-style-image:url('img/hierarchy_arrow.gif'); + list-style-position:inside; +} + +/*------------------------------------- + Rounded tables. +---------------------------------------*/ + +table.rt, +table.rt_with_bg{ + border-collapse:collapse; + border-spacing:0; + width:100%; +} +table.rt_with_bg{ + /*background-color:#C0F0A0;*/ + background-color:white; + color:inherit; +} + + +.rt_cTL{ + background-color:transparent; + background-repeat:no-repeat; + background-position:right; + width:8px; + height:8px; + margin:0; + padding:0; +} +.rt_cTL{ + background-image:url('img/cTL.gif'); +} + + +.rt_cBL{ + background-color:transparent; + background-repeat:no-repeat; + background-position:right; + width:8px; + height:8px; + margin:0; + padding:0; +} +.rt_cBL{ + background-image:url('img/cBL.gif'); +} + + +.rt_cTR{ + background-color:transparent; + background-repeat:no-repeat; + width:8px; + height:8px; + margin:0; + padding:0; + +} +.rt_cTR{ + background-image:url('img/cTR.gif'); +} + + +.rt_cBR{ + background-color:transparent; + background-repeat:no-repeat; + width:8px; + height:8px; + margin:0; + padding:0; + +} +.rt_cBR{ + background-image:url('img/cBR.gif'); +} + + +.rt_cnt{ + background-color:white; + color:inherit; + width:auto; + margin:0; + padding:0; +} + + +.rt_lL{ + background-color:transparent; + background-repeat:repeat-y; + background-position:right; + width:8px; + margin:0; + padding:0; + +} +.rt_lL{ + background-image:url('img/lL.gif'); +} + + +.rt_lR{ + background-repeat:repeat-y; + width:8px; + margin:0; + padding:0; +} +.rt_lR{ + background-image:url('img/lR.gif'); +} + + +.rt_lT{ + background-color:transparent; + background-repeat:repeat-x; + height:8px; + width:auto; + margin:0; + padding:0; +} +.rt_lT{ + background-image:url('img/lT.gif'); +} + +.rt_lB{ + background-color:transparent; + background-repeat:repeat-x; + height:8px; + width:auto; + margin:0; + padding:0; +} +.rt_lB{ + background-image:url('img/lB.gif'); +} + + +/* -------------------------------------- + Controls for bulk showing/hidding sections + from the documentation. +----------------------------------------*/ + +.globalControls h3{ + margin:0.1em; + font-size:1.2em; +} + +.globalControls table td{ + padding:0; + margin:0; +} + +.globalControls{ + position:fixed; + right:0; + background-color:transparent; + padding-left:0.5em; + padding-right:0.5em; + padding-bottom:0.5em; + width:190px; +} + +@media print{ + .globalControls{ + display:none; + } +} + +/* Expand/collapse of a single section. */ +input.control { + text-align:center; + vertical-align:middle; + padding:0; + padding-right:3px; + padding-bottom:2px; + +} + + +/* close button */ +td.rt_content div span input{ + font-size:0.8em; +} + +@media print{ + input.control{ + display:none; + } +} + +/*----------------------------------------- + Navigation. +------------------------------------------*/ +a, a:visited { + color:rgb(0, 0, 150); + background-color:inherit; + font-weight: bold; +} +a:link, a:visited { + text-decoration:none; +} +a:hover { + text-decoration:underline; +} +a.iRf { + display: block; +} + +div.toTop{ + text-align:right; +} +div.toTop a{ + font-weight:normal; +} + +/*------------------------------------------ + The second level of index. Floating DIVs +-------------------------------------------*/ +.toc { +} +.toc div.verticalLayout, div.horizontalLayout{ + float:left; + display:block; + + background-color:white; + color:inherit; + + min-width:130px; + min-height:50px; + + padding:0.5em; +} +/* This is not used. */ +.toc div.verticalLayout { + clear:left; +} + +/* Hack for the IE - acts like a minimum height.*/ +* html .toc div.horizontalLayout, +* html .toc div.verticalLayout { + width:120px; + height:60px; +} + +.toc div.componentGroupTitle{ + font-weight:bold; + margin-bottom:0.5em; + color:black; + background-color:inherit; +} + +/* Namespacces or system ids in the TOC. */ +.toc .indexGroupTitle { + font-weight:bold; + margin-bottom:0.5em; +} + +/*---------------- + The footer. +-----------------*/ +.footer{ + margin-top:3em; +} +.redX{ + color:red; + background-color:inherit; + font-size:1.2em; +} +.oXygenLogo{ + color:#1166DD; + background-color:inherit; + font-weight:bold; + font-size:1.2em; +} + + +/* List item from documentation format */ +ul > li.doc{ + list-style:disc; + margin-left:10px; +} + +/* Wrap the long lines in the 'pre' section. */ +pre { + white-space: pre-wrap; /* css-3 */ + white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ + white-space: -pre-wrap; /* Opera 4-6 */ + white-space: -o-pre-wrap; /* Opera 7 */ + word-wrap: break-word; /* Internet Explorer 5.5+ */ + _white-space: pre; /* IE only hack to re-specify in addition to word-wrap */ +} \ No newline at end of file diff --git a/bin/BiblioScript/xslt_doc/img/btM.gif b/bin/BiblioScript/xslt_doc/img/btM.gif new file mode 100644 index 0000000..78d309a Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/btM.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/btP.gif b/bin/BiblioScript/xslt_doc/img/btP.gif new file mode 100644 index 0000000..63e2535 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/btP.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/cBL.gif b/bin/BiblioScript/xslt_doc/img/cBL.gif new file mode 100644 index 0000000..aacb1da Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/cBL.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/cBR.gif b/bin/BiblioScript/xslt_doc/img/cBR.gif new file mode 100644 index 0000000..48879ca Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/cBR.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/cTL.gif b/bin/BiblioScript/xslt_doc/img/cTL.gif new file mode 100644 index 0000000..b52ae54 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/cTL.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/cTR.gif b/bin/BiblioScript/xslt_doc/img/cTR.gif new file mode 100644 index 0000000..136df09 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/cTR.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/hierarchy_arrow.gif b/bin/BiblioScript/xslt_doc/img/hierarchy_arrow.gif new file mode 100644 index 0000000..739bb65 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/hierarchy_arrow.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/lB.gif b/bin/BiblioScript/xslt_doc/img/lB.gif new file mode 100644 index 0000000..c0b44c6 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/lB.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/lL.gif b/bin/BiblioScript/xslt_doc/img/lL.gif new file mode 100644 index 0000000..bfbef22 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/lL.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/lR.gif b/bin/BiblioScript/xslt_doc/img/lR.gif new file mode 100644 index 0000000..cd75fdc Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/lR.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/lT.gif b/bin/BiblioScript/xslt_doc/img/lT.gif new file mode 100644 index 0000000..c67c576 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/lT.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/tplN12.gif b/bin/BiblioScript/xslt_doc/img/tplN12.gif new file mode 100644 index 0000000..6a99711 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/tplN12.gif differ diff --git a/bin/BiblioScript/xslt_doc/img/tplN16.gif b/bin/BiblioScript/xslt_doc/img/tplN16.gif new file mode 100644 index 0000000..e29b2b8 Binary files /dev/null and b/bin/BiblioScript/xslt_doc/img/tplN16.gif differ diff --git a/bin/BiblioScript/xslt_doc/parscit2mods.html b/bin/BiblioScript/xslt_doc/parscit2mods.html new file mode 100644 index 0000000..30cf9f8 --- /dev/null +++ b/bin/BiblioScript/xslt_doc/parscit2mods.html @@ -0,0 +1,2204 @@ + + + + + Stylesheet documentation for: parscit2mods.xsl + + +
+ + + + + + + + + + + + + + + + +
+

Showing:

+ + + + + + + + + + + + + + + + +
Documentation +
Parameters
Used by
References
Source
+
+
+
+

Table of Contents

+
+
+
Group by:
+
+ + + + +
Main stylesheet + parscit2mods.xsl
+ + + + + + + + + + + + + + + + +
+ + + + + + + +
Stylesheet version2.0
+
+
Template + citationList
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
MatchcitationList
Mode#default
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="citationList">
+  <xsl:element name="modsCollection" namespace="http://www.loc.gov/mods/v3">
+    <xsl:attribute name="xsi:schemaLocation" namespace="http://www.w3.org/2001/XMLSchema-instance">http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-2.xsd</xsl:attribute>
+    <xsl:comment>### JOURNAL ARTICLES ###</xsl:comment>
+    <xsl:apply-templates select="citation[journal]"/>
+    <xsl:comment>### BOOKS ###</xsl:comment>
+    <xsl:apply-templates select="citation[title and not(booktitle) and not(pages) and not(journal)]"/>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + citation
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
Documentation
+
+
+
<xd:doc> Here is where most of the inetersting stuff happen. </xd:doc>
+
NamespaceNo namespace
Matchcitation
Mode#default
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="citation">
+  <xsl:element name="mods" namespace="http://www.loc.gov/mods/v3">
+    <xsl:attribute name="ID">
+      <!-- add compatibility check for saxon or not -->
+      <xsl:value-of select="generate-id()"/>
+    </xsl:attribute>
+    <xsl:apply-templates select="title"/>
+    <!-- heuristic to determine the kind of resource -->
+    <xsl:choose>
+      <!-- CASE 1: Paper in a Journal-->
+      <xsl:when test="./journal">
+        <xsl:apply-templates select="authors" mode="journal_article"/>
+        <xsl:element name="relatedItem" namespace="http://www.loc.gov/mods/v3">
+          <xsl:attribute name="type">host</xsl:attribute>
+          <xsl:apply-templates select="journal" mode="journal_article"/>
+          <xsl:element name="originInfo" namespace="http://www.loc.gov/mods/v3">
+            <xsl:element name="issuance" namespace="http://www.loc.gov/mods/v3">continuing</xsl:element>
+          </xsl:element>
+          <xsl:element name="part" namespace="http://www.loc.gov/mods/v3">
+            <xsl:apply-templates select="*[name()!='authors'][name()!='journal']" mode="journal_article"/>
+          </xsl:element>
+          <xsl:element name="genre" namespace="http://www.loc.gov/mods/v3">
+            <xsl:attribute name="authority">marc</xsl:attribute>
+            <xsl:text>journal</xsl:text>
+          </xsl:element>
+          <xsl:element name="genre" namespace="http://www.loc.gov/mods/v3">
+            <xsl:text>academic journal</xsl:text>
+          </xsl:element>
+        </xsl:element>
+      </xsl:when>
+      <!-- CASE 2: Book -->
+      <xsl:when test=".[title and not(booktitle) and not(pages) and not(journal)]">
+        <xsl:apply-templates select="authors" mode="book"/>
+        <xsl:element name="originInfo" namespace="http://www.loc.gov/mods/v3">
+          <xsl:apply-templates select="location | date | publisher" mode="book"/>
+          <xsl:element name="issuance" namespace="http://www.loc.gov/mods/v3">monographic</xsl:element>
+        </xsl:element>
+      </xsl:when>
+      <xsl:otherwise>
+        <xsl:apply-templates select="*[name()!='authors']"/>
+      </xsl:otherwise>
+    </xsl:choose>
+    <!--<xsl:element name="typeOfResource" namespace="http://www.loc.gov/mods/v3">text</xsl:element>-->
+    <xsl:element name="identifier" namespace="http://www.loc.gov/mods/v3">
+      <xsl:attribute name="type">citekey</xsl:attribute>
+      <xsl:value-of select="generate-id()"/>
+    </xsl:element>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + authorsjournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
Documentation
+
+
+
<xd:doc>Auhtors of a journal article</xd:doc>
+
NamespaceNo namespace
Matchauthors
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="authors" mode="journal_article">
+  <xsl:apply-templates select="author">
+    <xsl:with-param name="mode">journal_article</xsl:with-param>
+  </xsl:apply-templates>
+</xsl:template>
+
+
+
+
Template + authorsbook
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
Documentation
+
+
+
<xd:doc>Auhtors of a book</xd:doc>
+
NamespaceNo namespace
Matchauthors
Modebook
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="authors" mode="book">
+  <xsl:apply-templates select="author">
+    <xsl:with-param name="mode">book</xsl:with-param>
+  </xsl:apply-templates>
+</xsl:template>
+
+
+
+
Template + author
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
Documentation
+
+
+
<xd:doc>Handles the creation of name elements in mods format. <xd:param type="string">The current mode.</xd:param>
+    </xd:doc>
+
NamespaceNo namespace
Matchauthor
Mode#default
+
Parameters
+
+
+
+ + + + + + + + + + + +
QNameNamespace
modeNo namespace
+
+
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="author">
+  <xsl:param name="mode"/>
+  <xsl:element name="name" namespace="http://www.loc.gov/mods/v3">
+    <xsl:attribute name="type">personal</xsl:attribute>
+    <xsl:for-each select="tokenize(.,' ')">
+      <xsl:element name="namePart" namespace="http://www.loc.gov/mods/v3">
+        <xsl:choose>
+          <xsl:when test="string-length(.)=1">
+            <xsl:attribute name="type">given</xsl:attribute>
+          </xsl:when>
+          <xsl:otherwise>
+            <xsl:attribute name="type">family</xsl:attribute>
+          </xsl:otherwise>
+        </xsl:choose>
+        <xsl:value-of select="."/>
+      </xsl:element>
+    </xsl:for-each>
+    <xsl:element name="role" namespace="http://www.loc.gov/mods/v3">
+      <xsl:element name="roleTerm" namespace="http://www.loc.gov/mods/v3">
+        <xsl:attribute name="authority">marcrelator</xsl:attribute>
+        <xsl:attribute name="type">text</xsl:attribute>
+        <xsl:choose>
+          <xsl:when test="$mode='journal_article'">
+            <xsl:text>author</xsl:text>
+          </xsl:when>
+          <xsl:when test="$mode='book'">
+            <xsl:text>creator</xsl:text>
+          </xsl:when>
+        </xsl:choose>
+      </xsl:element>
+    </xsl:element>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + title
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
+
Used by
+
+
+
+ + + + + +
Templatestitle; journaljournal_article
+
+
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template name="title">
+  <xsl:element name="titleInfo" namespace="http://www.loc.gov/mods/v3">
+    <xsl:element name="title" namespace="http://www.loc.gov/mods/v3">
+      <xsl:value-of select="."/>
+    </xsl:element>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + title
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchtitle
Mode#default
+
References
+
+
+
+ + + + + +
Templatetitle
+
+
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="title">
+  <xsl:call-template name="title"/>
+</xsl:template>
+
+
+
+
Template + journaljournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchjournal
Modejournal_article
+
References
+
+
+
+ + + + + +
Templatetitle
+
+
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="journal" mode="journal_article">
+  <xsl:call-template name="title"/>
+</xsl:template>
+
+
+
+
Template + pagesjournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchpages
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="pages" mode="journal_article">
+  <xsl:element name="extent" namespace="http://www.loc.gov/mods/v3">
+    <xsl:attribute name="unit">page</xsl:attribute>
+    <xsl:element name="start" namespace="http://www.loc.gov/mods/v3">
+      <xsl:value-of select="tokenize(.,'--')[1]"/>
+    </xsl:element>
+    <xsl:element name="end" namespace="http://www.loc.gov/mods/v3">
+      <xsl:value-of select="tokenize(.,'--')[2]"/>
+    </xsl:element>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + volumejournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchvolume
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="volume" mode="journal_article">
+  <xsl:element name="detail" namespace="http://www.loc.gov/mods/v3">
+    <xsl:attribute name="type">volume</xsl:attribute>
+    <xsl:element name="number" namespace="http://www.loc.gov/mods/v3">
+      <xsl:value-of select="."/>
+    </xsl:element>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + datejournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchdate
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template mode="journal_article" match="date">
+  <xsl:element name="date" namespace="http://www.loc.gov/mods/v3">
+    <xsl:value-of select="."/>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + locationbook
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchlocation
Modebook
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="location" mode="book">
+  <xsl:element name="place" namespace="http://www.loc.gov/mods/v3">
+    <xsl:element name="placeTerm" namespace="http://www.loc.gov/mods/v3">
+      <xsl:attribute name="type">text</xsl:attribute>
+      <xsl:value-of select="."/>
+    </xsl:element>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + datebook
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchdate
Modebook
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="date" mode="book">
+  <xsl:element name="dateIssued" namespace="http://www.loc.gov/mods/v3">
+    <xsl:value-of select="."/>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + publisherbook
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchpublisher
Modebook
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="publisher" mode="book">
+  <xsl:element name="publisher" namespace="http://www.loc.gov/mods/v3">
+    <xsl:value-of select="."/>
+  </xsl:element>
+</xsl:template>
+
+
+
+
Template + notesjournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchnotes
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template match="notes" mode="journal_article">
+  <xsl:comment>
+    <xsl:value-of select="."/>
+  </xsl:comment>
+</xsl:template>
+
+
+
+
Template + locationjournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchlocation
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template mode="journal_article" match="location"/>
+
+
+
+
Template + titlejournal_article
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + +
NamespaceNo namespace
Matchtitle
Modejournal_article
Import precedence0
+
Source
+
+
+
+ + + + +
<xsl:template mode="journal_article" match="title"/>
+
+
+
+
Output + (default)
+ + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + +
+
Documentation
+
+
+
<xd:author>Matteo Romanello</xd:author>
+
NamespaceNo namespace
+
Output properties
+
+
+
+ + + + + + + + + + + +
methodindent
xmlyes
+
+
+
Source
+
+
+
+ + + + +
<xsl:output method="xml" indent="yes"/>
+
+
+
+ + + \ No newline at end of file diff --git a/bin/W00-0102.body b/bin/W00-0102.body deleted file mode 100644 index 1d32488..0000000 --- a/bin/W00-0102.body +++ /dev/null @@ -1,4086 +0,0 @@ -Using -Long -Runs -as -Predictors -of -Semantic -Coherence -in -a -Partial -Document -Retrieval -System -Hyopil -Shin -Computing -Research -Laboratory, -NMSU -PO -Box -30001 -Las -Cruces, -NM, -88003 -hshin@crl.nmsu.edu -Jerrold -F. -Stach -Computer -Science -Telecommunications, -UMKC -5100 -Rockhill -Road -Kansas -City, -MO, -64110 -stach@cstp.umkc.edu -Abstract -We -propose -a -method -for -dealing -with -semantic -complexities -occurring -in -information -retrieval -systems -on -the -basis -of -linguistic -observations. -Our -method -follows -from -an -analysis -indicating -that -long -runs -of -content -words -appear -in -a -stopped -document -cluster, -and -our -observation -that -these -long -runs -predominately -originate -from -the -prepositional -phrase -and -subject -complement -positions -and -as -such, -may -be -useful -predictors -of -semantic -coherence. -From -this -linguistic -basis, -we -test -three -statistical -hypotheses -over -a -small -collection -of -documents -from -different -genre. -By -coordinating -thesaurus -semantic -categories -(SEMCATs) -of -the -long -run -words -to -the -semantic -categories -of -paragraphs, -we -conclude -that -for -paragraphs -containing -both -long -runs -and -short -runs, -the -SEMCAT -weight -of -long -runs -of -content -words -is -a -strong -predictor -of -the -semantic -coherence -of -the -paragraph. -Introduction -One -of -the -fundamental -deficiencies -of -current -information -retrieval -methods -is -that -the -words -searchers -use -to -construct -terms -often -are -not -the -same -as -those -by -which -the -searched -information -has -been -indexed. -There -are -two -components -to -this -problem, -synonymy -and -polysemy -(Deerwester -et. -al., -1990). -By -definition -of -polysemy, -a -document -containing -the -search -terms -or -indexed -with -the -search -terms -is -not -necessarily -relevant. -Polysemy -contributes -heavily -to -poor -precision. -Attempts -to -deal -with -the -synonymy -problem -have -relied -on -intellectual -or -automatic -term -expansion, -or -the -construction -of -a -thesaurus. -Also -the -ambiguity -of -natural -language -causes -semantic -complexities -that -result -in -poor -precision. -Since -queries -are -mostly -formulated -as -words -or -phrases -in -a -language, -and -the -expressions -of -a -language -are -ambiguous -in -many -cases, -the -system -must -have -ways -to -disambiguate -the -query. -In -order -to -resolve -semantic -complexities -in -information -retrieval -systems, -we -designed -a -method -to -incorporate -semantic -information -into -current -IR -systems. -Our -method -( -1 -) -adopts -widely -used -Semantic -Information -or -Categories, -(2) -calculates -Semantic -Weight -based -on -probability, -and -(3) -(for -the -purpose -of -verifying -the -method) -performs -partial -text -retrieval -based -upon -Semantic -Weight -or -Coherence -to -overcome -cognitive -overload -of -the -human -agent. -We -make -two -basic -assumptions: -1. -Matching -search -terms -to -semantic -categories -should -improve -retrieval -precision. -2. -Long -runs -of -content -words -have -a -linguistic -basis -for -Semantic -Weight -and -can -also -be -verified -statistically. -1 -A -Brief -Overview -of -Previous -Approaches -There -have -been -several -attempts -to -deal -with -complexity -using -semantic -information. -These -methods -are -hampered -by -the -lack -of -dictionaries -containing -proper -semantic -categories -for -classifying -text. -Semantic -methods -designed -by -Boyd -et. -al. -(1994) -and -Wendlandt -et. -al. -(1991) -demonstrate -only -simple -examples -and -are -restricted -to -small -numbers -of -words. -In -order -to -overcome -this -6 -deficiency, -we -propose -to -incorporate -the -structural -information -of -the -thesaurus, -semantic -categories -(SEMCATs). -However, -we -must -also -incorporate -semantic -categories -into -current -IR -systems -in -a -compatible -manner. -The -problem -we -deal -with -is -partial -text -retrieval -when -all -the -terms -of -the -traditional -vector -equations -are -not -known. -This -is -the -case -when -retrieval -is -associated -with -a -near -real -time -filter, -or -when -the -size -or -number -of -documents -in -a -corpus -is -unknown. -In -such -cases -we -can -retrieve -only -partial -text, -a -paragraph -or -page. -But -since -there -is -no -document -wide -or -corpus -wide -statistics, -it -is -difficult -to -judge -whether -or -not -the -text -fragment -is -relevant. -The -method -we -employ -in -this -paper -identifies -semantic -"hot -spots" -in -partial -text. -These -"hot -spots" -are -loci -of -semantic -coherence -in -a -paragraph -of -text. -Such -paragraphs -are -likely -to -convey -the -central -ideas -of -the -document, -We -also -deal -with -the -computational -aspects -of -partial -text -retrieval. -We -use -a -simple -stop/stem -method -to -expose -long -runs -of -context -words -that -are -evaluated -relative -to -the -search -terms. -Our -goal -is -not -to -retrieve -a -highly -relevant -sentence, -but -rather -to -retrieve -a -portion -of -text -that -is -semantically -coherent -with -respect -to -the -search -terms. -This -locale -can -be -returned -to -the -searcher -for -evaluation -and -if -it -is -relevant, -the -search -terms -can -be -refined. -This -approach -is -compatible -with -Latent -Semantic -Indexing -(LSI) -for -partial -text -retrieval -when -the -terms -of -the -vector -space -are -not -known. -LSI -is -based -on -a -vector -space -information -retrieval -method -that -has -demonstrated -improved -performance -over -the -traditional -vector -space -techniques. -So -when -incorporating -semantic -information, -it -is -necessary -to -adopt -existing -mathematical -methods -including -probabilistic -methods -and -statistical -methods. -2 -Theoretical -Background -2.1 -Long -Runs -Partial -Information -Retrieval -has -to -with -detection -of -main -ideas. -Main -ideas -are -topic -sentences -that -have -central -meaning -to -the -text. -Our -method -of -detecting -main -idea -paragraphs -extends -from -Jang -(1997) -who -observed -that -after -stemming -and -stopping -a -document, -long -runs -of -content -words -cluster. -Content -word -runs -are -a -sequence -of -content -words -with -a -function -word(s) -prefix -and -suffix. -These -runs -can -be -weighted -for -density -in -a -stopped -document -and -vector -processed. -We -observed -that -these -long -content -word -runs -generally -originate -from -the -prepositional -phrase -and -subject -complement -positions, -providing -a -linguistic -basis -for -a -dense -neighbourhood -of -long -runs -of -content -words -signalling -a -semantic -locus -of -the -writing. -We -suppose -that -these -neighbourhoods -may -contain -main -ideas -of -the -text. -In -order -to -verify -this, -we -designed -a -methodology -to -incorporate -semantic -features -into -information -retrieval -and -examined -long -runs -of -content -words -as -a -semantic -predictor. -We -examined -all -the -long -runs -of -the -Jang -(1997) -collection -and -discovered -most -of -them -originate -from -the -prepositional -phrase -and -subject -complement -positions. -According -to -Halliday -(1985), -a -preposition -is -explained -as -a -minor -verb. -It -functions -as -a -minor -Predicator -having -a -nominal -group -as -its -complement. -Thus -the -internal -structure -of -'across -the -lake' -is -like -that -of -'crossing -the -lake', -with -a -non-finite -verb -as -Predicator -(thus -our -choice -of -3 -words -as -a -long -run). -When -we -interpret -the -preposition -as -a -"minor -Predicator" -and -"minor -Process", -we -are -interpreting -the -prepositional -phrase -as -a -kind -of -minor -clause. -That -is, -prepositional -phrases -function -as -a -clause -and -their -role -is -predication. -Traditionally, -predication -is -what -a -statement -says -about -its -subject. -A -named -predication -corresponds -to -an -externally -defined -function, -namely -what -the -speaker -intends -to -say -his -or -her -subject, -i.e. -their -referent. -If -long -runs -largely -appear -in -predication -positions, -it -would -suggest -that -the -speaker -is -saying -something -important -and -the -longer -runs -of -content -words -would -signal -a -locus -of -the -speaker's -intention. -Extending -from -the -statistical -analysis -of -Jang -(1997) -and -our -observations -of -those -long -runs -in -the -collection, -we -give -a -basic -assumption -of -OUT -study: -Long -runs -of -content -words -contain -significant -semantic -information -that -a -speaker -wants -to -express -and -focus, -and -thus -are -semantic -indicators -or -loci -or -main -ideas. -7 -In -this -paper, -we -examine -the -SEMCAT -values -of -long -and -short -runs, -extracted -from -a -random -document -of -the -collection -in -Jang -(1997), -to -determine -if -the -SEMCAT -weights -of -long -runs -of -content -words -are -semantic -predictors. -2.2 -SEMCATs -We -adopted -Roget's -Thesaurus -for -our -basic -semantic -categories -(SEMCATs). -We -extracted -the -semantic -categories -from -the -online -Thesaurus -for -convenience. -We -employ -the -39 -intermediate -categories -as -basic -semantic -information, -since -the -6 -main -categories -are -too -general, -and -the -many -sub-categories -are -too -narrow -to -be -taken -into -account. -We -refer -to -these -39 -categories -as -SEMCATs. -Table -1: -Semantic -Categories -(SEMCATs) -Abbreviation -Full -Description -1 -AFIG -Affection -in -General -2 -ANT -Antagonism -3 -CAU -Causation -4 -CHN -Change -5 -COIV -Conditional -Intersocial -Volition -6 -CRTH -Creative -Thought -7 -DIM -Dimensions -EXIS -Existence -9 -EXOT -Extension -of -Thought -1° -FORM -Form -11 -GINV -General -Inter -social -Volition -12 -INOM -Inorganic -Matter -13 -MECO -Means -of -Communication -14 -MFRE -Materials -for -Reasoning -15 -MIG -Matter -ingeneral -16 -MOAF -Moral -Affections -17 -MOCO -Modes -of -Communication -18 -MOT -Motion -19 -NOIC -Nature -of -Ideas -Communicated -20 -NUM -Number -21 -opm -Operations -of -Intelligence -In -General -22 -ORD -Order -23 -ORGM -Organic -Matter -24 -pEAF -Personal -Affections -25 -PORE -Possessive -Relations -26 -PRCO -Precursory -Conditions -and -Operations -27 -PRVO -Prospective -Volition -28 -QUAN -Quantity -29 -REAF -Religious -Affections -ao -RELN -Relation -31 -REOR -Reasoning -Organization -32 -REPR -Reasoning -Process -33 -ROVO -Result -of -Voluntary -Action -34 -SIG -Space -in -General -35 -S -IVO -Special -Inter -social -Volition -36 -SYAF -Sympathetic -Affections -37 -TIME -Time -38 -VOAC -Voluntary -Action -39 -VOIG -Volition -in -General -2.3 -Indexing -Space -and -Stop -Lists -Many -of -the -most -frequently -occurring -words -in -English, -such -as -"the," -"of," -"and," -"to," -etc. -are -non-discriminators -with -respect -to -information -filtering. -Since -many -of -these -function -words -make -up -a -large -fraction -of -the -text -of -Most -documents, -their -early -elimination -in -the -indexing -process -speeds -processing, -saves -significant -amounts -of -index -space -and -does -not -compromise -the -filtering -process. -In -the -Brown -Corpus, -the -frequency -of -stop -words -is -551,057 -out -of -1,013,644 -total -words. -Function -words -therefore -account -for -about -54.5% -of -the -tokens -in -a -document. -The -Brown -Corpus -is -useful -in -text -retrieval -because -it -is -small -and -efficiently -exposes -content -word -runs. -Furthermore, -minimizing -the -document -token -size -is -very -important -in -NLP- -based -methods, -because -NLP-based -methods -usually -need -much -larger -indexing -spaces -than -statistical-based -methods -due -to -processes -for -tagging -and -parsing. -3 -Experimental -Basis -In -order -to -verify -that -long -runs -contribute -to -resolve -semantic -complexities -and -can -be -used -as -predictors -of -semantic -intent, -we -employed -a -probabilistic, -vector -processing -methodology. -3.1 -Revised -Probability -and -Vector -Processing -In -order -to -understand -the -calculation -of -SEMCATs, -it -is -helpful -to -look -at -the -structure -8 -of -a -preprocessed -document. -One -document -"Barbie" -in -the -Jang -(1997) -collection -has -a -total -of -1,468 -words -comprised -of -755 -content -words -and -713 -function -words. -The -document -has -17 -paragraphs. -Filtering -out -function -words -using -the -Brown -Corpus -exposed -the -runs -of -content -words -as -shown -in -Figure -1. -Figure -1: -Preprocessed -Text -Document -BARBIE -* -* -* -* -FAVORITE -COMPANION -DETRACTORS -LOVE -* -* -* -PLASTIC -PERFECTION -* -FASHION -DOLL -* -* -IMPOSSIBLE -FIGURE -* -LONG -* -* -* -POPULAR -GIRL -* -MA -ITEL -* -WORLD -* -TOYMAKER -* -PRODUCTS -RANGE -* -FISHER -PRICE -INFANT -* -SALES -* -* -* -TALL -MANNEQUIN -* -BARBIE -* -* -AGE -* -* -* -BEST -SELLING -GIRLS -BRAND -* -* -POISED -* -STRUT -* -* -CHANGE -* -* -MALE -DOMINATED -WORLD -* -MULTIMEDIA -SOFTWARE -* -VIDEO -GAMES -In -Figure -1, -asterisks -occupy -positions -where -function -words -were -filtered -out. -The -bold -type -indicates -the -location -of -the -longest -runs -of -content -words. -The -run -length -distribution -of -Figure -1 -is -shown -below: -Table -2: -Distribution -of -Content -Run -Lengths -in -a -sam -le -Document -Run -Length -Frequency -1 -II -2 -8 -3 -2 -4 -2 -The -traditional -vector -processing -model -requires -the -following -set -of -terms: -• -(dl) -the -number -of -documents -in -the -collection -that -each -word -occurs -in -• -(id° -the -inverse -document -frequency -of -each -word -determined -by -logio(N/df) -where -N -is -the -total -number -of -documents. -If -a -word -appears -in -a -query -but -not -in -a -document, -its -idf -is -undefined. -• -The -category -probability -of -each -query -word. -Wendlandt -(1991) -points -out -that -it -is -useful -to -retrieve -a -set -of -documents -based -upon -key -words -only, -and -then -considers -only -those -documents -for -semantic -category -and -attribute -analysis. -Wendlandt -(1991) -appends -the -s -category -weights -to -the -t -term -weights -of -each -document -vector -Di -and -the -Query -vector -Q. -Since -our -basic -query -unit -is -a -paragraph, -document -frequency -(dl) -and -inverse -document -frequency -(idf) -have -to -be -redefined. -As -we -pointed -out -in -Section -1, -all -terms -are -not -known -in -partial -text -retrieval. -Further, -our -approach -is -based -on -semantic -weight -rather -than -word -frequency. -Therefore -any -frequency -based -measures -defined -by -Boyd -et -al. -(1994) -and -Wendlandt -(1991) -need -to -be -built -from -the -probabilities -of -individual -semantic -categories. -Those -modifications -are -described -below. -As -a -simplifying -assumption, -we -assume -SEMCATs -have -a -uniform -probability -distribution -with -regard -to -a -word. -3.2 -Calculating -SEMCATs -Our -first -task -in -computing -SEMCAT -values -was -to -create -a -SEMCAT -dictionary -for -our -method. -We -extracted -SEMCATs -for -every -word -from -the -World -Wide -Web -version -of -Roget's -thesaurus. -SEMCATs -give -probabilities -of -a -word -corresponding -to -a -semantic -category. -The -content -word -run -'favorite -companion -detractors -love' -is -of -length -4. -Each -word -of -the -run -maps -to -at -least -one -SEMCAT. -The -word -`favorite' -maps -to -categories -`PEAF -and -SYAF'. -'companion' -maps -to -categories -'ANT, -MECO, -NUM, -ORD, -ORGM, -PEAF, -PRVO, -QUAN, -and -SYAF'. -'detractor' -maps -to -`MOAF'. -'love' -maps -to -`AFIG, -ANT, -MECO, -MOAF, -MOCO, -ORGM, -PEAF, -PORE, -PRVO, -SYAF, -and -VOIG'. -We -treat -the -long -runs -as -a -semantic -core -from -which -to -calculate -SEMCAT -values. -SEMCAT -weights -are -calculated -based -on -the -following -equations. -Eq.1 -Pik(Probability) -- -The -likelihood -of -SEMCAT -Si -occurring -due -to -the -le -trigger. -For -example, -assuming -a -uniform -probability -distribution, -the -category -PEAF -triggered -by -the -word -favorite -above, -has -the -following -probability: -PPEAF, -favorite -= -0.5(112) -Eq.2 -Sw; -(SEMCAT -Weights -in -Long -runs) -is -the -sum -of -each -SEMCATO -weight -of -long -runs -based -on -their -probabilities. -In -the -above -example, -the -long -run -9 -'favorite -companion -detractors -love,' -the -SEMCAT -`MOAF' -has -SWMOAF -(detractor(1) -love(.09)) -= -1.09. -We -can -write; -SWi -= -I -p,, -Eq.3 -edwj -(Expected -data -weights -in -a -paragraph) -- -Given -a -set -of -N -content -words -(data) -in -a -paragraph, -the -expected -weight -of -the -SEMCATs -of -long -runs -in -a -paragraph -is: -edwj -= -pi; -,=1 -Eq.4 -idwj -(Inverse -data -weights -in -a -paragraph) -- -The -inverse -data -weight -of -SEMCATs -of -long -runs -for -a -set -of -N -content -words -in -a -paragraph -is -N -), -ichvi=logio((- -edwi -Eq.5 -Weight(W) -- -The -weight -of -SEMCAT -Si -in -a -paragraph -is -W; -= -Swjxidw; -Eq.6 -Relevance -Weights -(Semantic -Coherence) -Our -method -performs -the -following -steps: -1. -calculate -the -SEMCAT -weight -of -each -long -content -word -run -in -every -paragraph -(Sw) -2. -calculate -the -expected -data -weight -of -each -paragraph -(edw) -3. -calculate -the -inverse -expected -data -weight -of -each -paragraph -(idw) -4. -calculate -the -actual -weight -of -each -paragraph -(Swxidw) -5. -calculate -coherence -weights -(total -relevance) -by -summing -the -weights -of -(Swxidw). -In -every -paragraph, -extraction -of -SEMCATs -from -long -runs -is -done -first. -The -next -step -is -finding -the -same -SEMCATs -of -long -runs -through -every -word -in -a -paragraph -(expected -data -weight), -then -calculate -idw, -and -finally -Swxidw. -The -final, -total -relevance -weights -are -an -accumulation -of -all -weights -of -SEMCATs -of -content -words -in -a -paragraph. -Total -relevance -tells -how -many -SEMCATs -of -the -Query's -long -runs -appear -in -a -paragraph. -Higher -values -imply -that -the -paragraph -is -relevant -to -the -long -runs -of -the -Query. -The -following -is -a -program -output -for -calculating -SEMCAT -weights -for -an -arbitrary -long -run: -"SEVEN -INTERACTIVE -PRODUCTS -LED" -SEMCAT: -EXOT -Sw -: -1.00 -edw -: -1.99 -idw -: -1.44 -Swxidw -: -1.44 -SEMCAT: -GINV -Sw -: -0.33 -edw -: -1.62 -idw -: -1.53 -Swxidw -: -0.51 -SEMCAT: -MOT -Sw -: -0.20 -edw -: -0.71 -idw -: -1.89 -Swxidw -: -0.38 -SEMCAT: -NUM -Sw -: -0.20 -edw -: -1.76 -idw -: -1.49 -Swxidw -: -0.30 -SEMCAT: -ORGM -Sw -: -0.20 -edw -: -1.67 -idw -1.52 -Swxidw -; -0,30 -SEMCAT: -PEAF -Sw -: -0.53 -edw -: -1.50 -idw -: -1.56 -Swxidw -: -0.83 -SEMCAT: -REAF -Sw -: -0.20 -edw -: -0.20 -idw -: -2.44 -Swxidw -: -0.49 -SEMCAT: -SYAF -Sw -: -0.33 -edw -: -1.19 -idw -: -1.66 -Swxidw -: -0.55 -Total -(Swxidw) -: -4,79 -4 -Experimental -Results -The -goal -of -employing -probability -and -vector -processing -is -to -prove -the -linguistic -basis -that -long -runs -of -content -words -can -be -used -as -predictors -of -semantic -intent -But -we -also -want -to -exploit -the -computational -advantage -of -removing -the -function -words -from -the -document, -which -reduces -the -number -of -tokens -processed -by -about -50% -and -thus -reduces -vector -space -and -probability -computations. -If -it -is -true -that -long -runs -of -content -words -are -predictors -of -semantic -coherence, -we -can -further -reduce -the -complexity -of -vector -computations: -(1) -by -eliminating -those -paragraphs -without -long -runs -from -consideration, -(2) -within -remaining -paragraphs -with -long -runs, -computing -and -summing -the -semantic -coherence -of -the -longest -runs -only, -(3) -ranking -the -eligible -paragraphs -for -retrieval -based -upon -their -semantic -weights -relative -to -the -query. -Jang -(1997) -established -that -the -distribution -of -long -runs -of -content -words -and -short -runs -of -content -words -in -a -collection -of -paragraphs -are -drawn -from -different -populations. -This -implies -10 -that -either -long -runs -or -short -runs -are -predictors, -but -since -all -paragraphs -contain -short -runs, -i.e. -a -single -content -word -separated -by -function -words, -only -long -runs -can -be -useful -predictors. -Furthermore, -only -long -runs -as -we -define -them -can -be -used -as -predictors -because -short -runs -are -insufficient -to -construct -the -language -constructs -for -prepositional -phrase -and -subject -complement -positions. -If -short -runs -were -discriminators, -the -linguistic -assumption -of -this -research -would -be -violated. -The -statistical -analysis -of -Jang -(1997) -does -not -indicate -this -to -be -the -case. -To -proceed -in -establishing -the -viability -of -our -approach, -we -proposed -the -following -experimental -hypotheses: -(111) -The -SEMCAT -weights -for -long -runs -of -content -words -are -statistically -greater -than -weights -for -short -runs -of -content -words. -Since -each -content -word -can -map -to -multiple -SEMCATs, -we -cannot -assume -that -the -semantic -weight -of -a -long -run -is -a -function -of -its -length. -The -semantic -coherence -of -long -runs -should -be -a -more -granular -discriminator. -(112) -For -paragraphs -containing -long -runs -and -short -runs, -the -distribution -of -long -run -SEMCAT -weights -is -statistically -different -from -the -distribution -of -short -run -SEMCAT -weights. -(H3) -There -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -semantic -coherence -of -a -paragraph, -the -total -paragraph -SEMCAT -weight. -A -detailed -description -of -these -experiments -and -their -outcome -are -described -in -Shin -(1997, -1999). -The -results -of -the -experiments -and -the -implications -of -those -results -relative -to -the -method -we -propose -are -discussed -below. -Table -3 -gives -the -SEMCAT -weights -for -seventeen -paragraphs -randomly -chosen -from -one -document -in -the -collection -of -Jang -(1997). -Table -3: -SEMCAT -Weights -of -17 -Paragraphs -Chosen -Randomly -From -a -Collection -Paragraph -Short -Runs -Long -Runs -Weight -Weight -1 -29.84 -18.60 -2 -31.29 -12.81 -3 -23.29 -4.25 -4 -23.94 -11.63 -5 -34.63 -35.00 -6 -22.85 -03.32 -7 -21.74 -00.00 -8 -35.84 -15.94 -9 -30.15 -00.00 -10 -13.40 -00.00 -11 -23.01 -07.82 -12 -31.69 -04.79 -13 -36.54 -00.00 -14 -17.91 -10.55 -15 -19.70 -05.83 -16 -17.11 -00.00 -17 -31.86 -00.00 -The -data -was -evaluated -using -a -standard -two -way -F -test -and -analysis -of -variance -table -with -ot -= -.05. -The -analysis -of -variance -table -for -the -paragraphs -in -Table -3 -is -shown -in -Table -4. -Table -4: -Analysis -of -Variance -for -Table -2 -Data -Variation -Degrees -of -Mean -Square -F -Freedom -Between -1 -2904.51 -68.56 -Treatments -V, -= -2904.51 -Between -Blocks -16 -93.92 -2.21 -yr -= -1502.83 -Residual -or -16 -42.36 -Random -V,= -677.77 -Total -33 -V -= -5085.11 -At -the -.05 -significance -level, -Fa -05 -= -4.49 -for -1,16 -degrees -of -freedom. -Since -68.56 -> -4.49 -we -reject -the -assertion -that -column -means -(run -weights) -are -equal -in -Table -2. -Long -run -and -short -run -weights -come -from -different -populations. -We -accept -Hl. -For -the -between -paragraph -treatment, -the -row -means -(paragraph -weights) -have -an -F -value -of -2.21. -At -the -.05 -significance -level, -Fa -. -05 -= -2.28 -for -16,16 -degrees -of -freedom. -Since -2.21 -< -2.28 -we -cannot -reject -the -assertion -that -there -is -no -significant -difference -in -SEMCAT -weights -between -paragraphs. -That -is, -paragraph -weights -do -not -appear -to -be -taken -from -different -populations, -as -do -the -long -run -and -short -run -weight -distributions. -Thus, -the -semantic -weight -11 -of -the -content -words -in -a -paragraph -cannot -be -used -to -predict -the -semantic -weight -of -the -paragraph. -We -therefore -proceed -to -examine -H2. -Notice -that -two -paragraphs -in -Table -2 -are -without -long -runs. -We -need -to -repeat -the -analysis -of -variance -for -only -those -paragraphs -with -long -runs -to -see -if -long -runs -are -discriminators. -Table -5 -summarizes -those -paragraphs. -Table -5: -SEMCAT -weights -of -11 -paragraphs -containing -Ion -runs -and -short -runs -Paragraph -Short -Runs -Long -Runs -Weight -Weight -1 -29.84 -18.60 -2 -31.29 -12.81 -3 -23.29 -4.25 -4 -23.94 -11,63 -5 -34.63 -35.00 -6 -22.85 -03.32 -8 -35.84 -15.94 -11 -23.01 -07.82 -12 -31.69 -04.79 -14 -17.91 -10.55 -15 -19.70 -05.83 -This -data -was -evaluated -using -a -standard -two -way -F -test -and -analysis -of -variance -with -a -= -.05. -The -analysis -of -variance -table -for -the -paragraphs -in -Table -5 -follows. -Table -6: -Analysis -of -Variance -for -Table -5 -Data -Variation -._ -Mean -Square -F -Degrees -of -Freedom -Between -Treatments -1 -1430.98 -291.44 -V= -1430.98 -Between -Blocks -10 -94.40 -19.22 -V= -944.05 -Residual -or -10 -4.91 -Random -V,...- -49.19 -Total -21 -V -= -2424.26 -At -the -.05 -significance -level, -F. -.05 -= -4.10 -for -2,10 -degrees -of -freedom. -4.10 -< -291.44. -At -the -.05 -significance -level, -F. -= -2.98 -for -10,10 -degrees -of -freedom. -2.98 -< -19.22. -For -paragraphs -in -a -collection -containing -both -long -and -short -runs: -the -SEMCAT -weights -of -the -long -runs -and -short -runs -are -drawn -from -different -distributions. -We -accept -112. -For -paragraphs -containing -long -runs -and -short -runs, -the -distributions -of -long -run -SEMCAT -weights -is -different -from -the -distribution -of -short -run -SEMCAT -weights. -We -know -from -the -linguistic -basis -for -long -runs -that -short -runs -cannot -be -used -as -predictors. -We -therefore -proceed -to -examine -the -Pearson -correlation -between -the -long -run -SEMCAT -weights -and -paragraph -SEMCAT -weights -for -those -paragraphs -with -both -long -and -short -content -word -runs. -Table -7: -Correlation -of -Long -Run -SEMCAT -Wei -hts -to -Para -ra -h -SEMCAT -Weight -Paragraph -Long -Runs -Semantic -Weight -Paragraph -Semantic -Weight -1 -18.60 -48.44 -2 -12.81 -44.10 -3 -4.25 -27.54 -4 -11.63 -35.57 -5 -35.00 -69.63 -6 -03.32 -26.17 -8 -15.94 -51.78 -11 -07.82 -30.83 -12 --04.79 -31.69 -14 -10.55 -28.46 -15 -05.83 -25.53 -The -weights -in -Table -have -a -positive -Pearson -Product -Correlation -coefficient -of -.952. -We -therefore -accept -1-13. -There -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -semantic -coherence -of -a -paragraph, -the -total -paragraph -SEMCAT -weight. -5. -Conclusion -This -research -tested -three -statistical -hypotheses -extending -from -two -observations: -(1) -fang -(1997) -observed -the -clustering -of -long -runs -of -content -words -and -established -the -distribution -of -long -run -lengths -and -short -run -lengths -are -drawn -from -different -populations, -(2) -our -observation -that -these -long -runs -of -content -words -originate -from -the -prepositional -phrase -and -subject -complement -positions. -According -to -Halliday -(1985) -those -grammar -structures -function -as -12 -minor -predication -and -as -such -are -loci -of -semantic -intent -or -coherence. -In -order -to -facilitate -the -use -of -long -runs -as -predictors, -we -modified -the -traditional -measures -of -Boyd -et -al. -(1994), -Wendlandt -(1991) -to -accommodate -semantic -categories -and -partial -text -retrieval. -The -revised -metrics -and -the -computational -method -we -propose -were -used -in -the -statistical -experiments -presented -above. -The -main -findings -of -this -work -are -1. -the -distribution -semantic -coherence -(SEMCAT -weights) -of -long -runs -is -not -statistically -greater -than -that -of -short -runs, -2. -for -paragraphs -containing -both -long -runs -and -short -runs, -the -SEMCAT -weight -distributions -are -drawn -from -different -populations -3. -there -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -total -SEMCAT -weight -of -the -paragraph -(its -semantic -coherence). -Significant -additional -work -is -required -to -validate -these -preliminary -results. -The -collection -employed -in -Jang -(1997) -is -not -a -standard -Corpus -so -we -have -no -way -to -test -precision -and -relevance -of -the -proposed -method. -The -results -of -the -proposed -method -are -subject -to -the -accuracy -of -the -stop -lists -and -filtering -function. -Nonetheless, -we -feel -the -approach -proposed -has -potential -to -improve -performance -through -reduced -token -processing -and -increased -relevance -through -consideration -of -semantic -coherence -of -long -runs. -Significantly, -our -approach -does -not -require -knowledge -of -the -collection. -References diff --git a/bin/W00-0102.old b/bin/W00-0102.old deleted file mode 100644 index 4846cb8..0000000 --- a/bin/W00-0102.old +++ /dev/null @@ -1,145 +0,0 @@ - - - - -Using Long Runs as Predictors of Semantic Coherence in a Partial Document Retrieval System -Hyopil Shin -Computing Research Laboratory, NMSU -
PO Box 30001 Las Cruces, NM, 88003
-hshin@crl.nmsu.edu -Jerrold F Stach -Computer Science Telecommunications, UMKC -
5100 Rockhill Road Kansas City, MO, 64110
-stach@cstp.umkc.edu -We propose a method for dealing with semantic complexities occurring in information retrieval systems on the basis of linguistic observations. Our method follows from an analysis indicating that long runs of content words appear in a stopped document cluster, and our observation that these long runs predominately originate from the prepositional phrase and subject complement positions and as such, may be useful predictors of semantic coherence. From this linguistic basis, we test three statistical hypotheses over a small collection of documents from different genre. By coordinating thesaurus semantic categories (SEMCATs) of the long run words to the semantic categories of paragraphs, we conclude that for paragraphs containing both long runs and short runs, the SEMCAT weight of long runs of content words is a strong predictor of the semantic coherence of the paragraph -
-
- - - - -R Boyd -J Driscoll -I Syu - -incorporating Semantics Within a Connectionist Model and a Vector Processing Model -1994 -In Proceedings of the TREC-2 -NIST. - -ed out in Section 1, all terms are not known in partial text retrieval. Further, our approach is based on semantic weight rather than word frequency. Therefore any frequency based measures defined by Boyd et al. (1994) and Wendlandt (1991) need to be built from the probabilities of individual semantic categories. Those modifications are described below. As a simplifying assumption, we assume SEMCATs have a uniform -ar structures function as 12 minor predication and as such are loci of semantic intent or coherence. In order to facilitate the use of long runs as predictors, we modified the traditional measures of Boyd et al. (1994), Wendlandt (1991) to accommodate semantic categories and partial text retrieval. The revised metrics and the computational method we propose were used in the statistical experiments presented above. - -Boyd, Driscoll, Syu, 1994 -Boyd R., Driscoll J, and Syu I. (1994) incorporating Semantics Within a Connectionist Model and a Vector Processing Model. In Proceedings of the TREC-2, NIST. - - - -S Deerwester -G Furnas -T Landauer -R Harshman - -Indexing by Latent Semantic Anaysis -1990 -Journal of the American Society of Information Science -41--6 -Deerwester, Furnas, Landauer, Harshman, 1990 -Deerwester S., Furnas G., Landauer T., and Harshman R. (1990) Indexing by Latent Semantic Anaysis. Journal of the American Society of Information Science 41-6. - - - -M A K Halliday - -An Introduction to Functional Grammar. Edward -1985 -Arnold, London - -as a semantic predictor. We examined all the long runs of the Jang (1997) collection and discovered most of them originate from the prepositional phrase and subject complement positions. According to Halliday (1985), a preposition is explained as a minor verb. It functions as a minor Predicator having a nominal group as its complement. Thus the internal structure of 'across the lake' is like that of 'crossing th -hort run lengths are drawn from different populations, (2) our observation that these long runs of content words originate from the prepositional phrase and subject complement positions. According to Halliday (1985) those grammar structures function as 12 minor predication and as such are loci of semantic intent or coherence. In order to facilitate the use of long runs as predictors, we modified the traditional - -Halliday, 1985 -Halliday M.A.K. (1985) An Introduction to Functional Grammar. Edward Arnold, London. - - - -S Jang - -Extracting Context from Unstructured Text Documents by Content Word Density -1997 -M.S. Thesis -University of Missouri-Kansas City - -Runs Partial Information Retrieval has to with detection of main ideas. Main ideas are topic sentences that have central meaning to the text. Our method of detecting main idea paragraphs extends from Jang (1997) who observed that after stemming and stopping a document, long runs of content words cluster. Content word runs are a sequence of content words with a function word(s) prefix and suffix. These runs c -erify this, we designed a methodology to incorporate semantic features into information retrieval and examined long runs of content words as a semantic predictor. We examined all the long runs of the Jang (1997) collection and discovered most of them originate from the prepositional phrase and subject complement positions. According to Halliday (1985), a preposition is explained as a minor verb. It functions -tions, it would suggest that the speaker is saying something important and the longer runs of content words would signal a locus of the speaker's intention. Extending from the statistical analysis of Jang (1997) and our observations of those long runs in the collection, we give a basic assumption of OUT study: Long runs of content words contain significant semantic information that a speaker wants to express -ogy. 3.1 Revised Probability and Vector Processing In order to understand the calculation of SEMCATs, it is helpful to look at the structure 8 of a preprocessed document. One document &quot;Barbie&quot; in the Jang (1997) collection has a total of 1,468 words comprised of 755 content words and 713 function words. The document has 17 paragraphs. Filtering out function words using the Brown Corpus exposed the runs of co -raphs with long runs, computing and summing the semantic coherence of the longest runs only, (3) ranking the eligible paragraphs for retrieval based upon their semantic weights relative to the query. Jang (1997) established that the distribution of long runs of content words and short runs of content words in a collection of paragraphs are drawn from different populations. This implies 10 that either long ru - -Jang, 1997 -Jang S. (1997) Extracting Context from Unstructured Text Documents by Content Word Density. M.S. Thesis, University of Missouri-Kansas City. - - - -A Moffat -R Davis -R Wilkinson -J Zobel - -Retrieval of Partial Documents -1994 -In Proceedings of TREC-2 -Moffat, Davis, Wilkinson, Zobel, 1994 -Moffat A., Davis R., Wilkinson, R., and Zobel J. (1994) Retrieval of Partial Documents. In Proceedings of TREC-2. - - - -H Shin - -Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System -1997 -M.S. Thesis -University of Missouri Kansas City - -between the sum of long run SEMCAT weights and the semantic coherence of a paragraph, the total paragraph SEMCAT weight. A detailed description of these experiments and their outcome are described in Shin (1997, 1999). The results of the experiments and the implications of those results relative to the method we propose are discussed below. Table 3 gives the SEMCAT weights for seventeen paragraphs randomly - -Shin, 1997 -Shin H. (1997) Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System. M.S. Thesis, University of Missouri Kansas City. - - - -H Shin -J Stach - -Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval -1999 -Shin, Stach, 1999 -Shin H., Stach J. (1999) Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval. - - -1999 -Journal of Computer Science and Information Management -2 -to appear - -en the sum of long run SEMCAT weights and the semantic coherence of a paragraph, the total paragraph SEMCAT weight. A detailed description of these experiments and their outcome are described in Shin (1997, 1999). The results of the experiments and the implications of those results relative to the method we propose are discussed below. Table 3 gives the SEMCAT weights for seventeen paragraphs randomly chosen - -1999 -Journal of Computer Science and Information Management, vol. 2, No. 4, December 1999, to appear. - - - -E Wendlandt -R Driscoll - -Incorporating a semantic analysis into a document retrieval strategy -1991 -CACM -31 -54--48 -Wendlandt, Driscoll, 1991 -Wendlandt E. and Driscoll R. (1991) Incorporating a semantic analysis into a document retrieval strategy. CACM 31, pp. 54-48. - - - -
\ No newline at end of file diff --git a/bin/citeExtract.pl b/bin/citeExtract.pl index 70ba4b7..d3ca892 100755 --- a/bin/citeExtract.pl +++ b/bin/citeExtract.pl @@ -43,6 +43,7 @@ =head1 HISTORY my $defaultMode = $PARSCIT; my $defaultInputType = "raw"; my $outputVersion = "100401"; +my $biblioScript ="$FindBin::Bin/BiblioScript/biblio_script.sh"; ### END user customizable section ### Ctrl-C handler @@ -55,13 +56,14 @@ sub quitHandler { sub Help { print STDERR "usage: $progname -h\t\t\t\t[invokes help]\n"; print STDERR " $progname -v\t\t\t\t[invokes version]\n"; - print STDERR " $progname [-qt] [-m ] [-i ] [outfile]\n"; + print STDERR " $progname [-qt] [-m ] [-i ] [-e ] [outfile]\n"; print STDERR "Options:\n"; print STDERR "\t-q\tQuiet Mode (don't echo license)\n"; # Thang v100401: add new mode (extract_section), and -i print STDERR "\t-m \tMode (extract_citations, extract_header, extract_section, extract_meta, extract_all, default: extract_citations)\n"; print STDERR "\t-i \tType (raw, xml, default: raw)\n"; + print STDERR "\t-e \tExport citations into multiple types (ads|bib|end|isi|ris|wordbib). Multiple types could be specified by contatenating with \"-\" e.g., bib-end-ris. Output files will be named as outfile.exportFormat, with outfile being the input argument, and exportFormat being each individual format supplied by -e option.\n"; print STDERR "\t-t\tUse token level model instead\n"; print STDERR "\n"; print STDERR "Will accept input on STDIN as a single file.\n"; @@ -88,9 +90,9 @@ sub Version { } $SIG{'INT'} = 'quitHandler'; -getopts ('hqm:i:tv'); +getopts ('hqm:i:e:tv'); -our ($opt_q, $opt_v, $opt_h, $opt_m, $opt_i, $opt_t); +our ($opt_q, $opt_v, $opt_h, $opt_m, $opt_i, $opt_e, $opt_t); # use (!defined $opt_X) for options with arguments if ($opt_v) { Version(); exit(0); } # call Version, if asked for if ($opt_h) { Help(); exit (0); } # call help, if asked for @@ -103,12 +105,42 @@ sub Version { ### Thang v100401: add input type option, and SectLabel ### my $isXmlInput = 0; if(defined $opt_i && $opt_i !~ /^(xml|raw)$/){ - print STDERR "Input type needs to be either \"raw\" or \"xml\"\n"; + print STDERR "#! Input type needs to be either \"raw\" or \"xml\"\n"; Help(); exit (0); } elsif(defined $opt_i && $opt_i eq "xml"){ $isXmlInput = 1; } +### Thang v100901: add export type option & incorporate BibUtils### +my @exportTypes = (); +if(defined $opt_e && $opt_e ne ""){ + # sanity checks + if (($mode & $PARSCIT) != $PARSCIT) { # No call to extract_citation + print STDERR "#! Export type option is only available for the following modes: extract_citations, extract_meta and extract_all\n"; + Help(); exit(0); + } + if(! defined $out){ + print STDERR "#! Export type option requires output file name to be specified\n"; + Help(); exit(0); + } + + # get individual export types + my %typeHash = (); + my @tokens = split(/\-/, $opt_e); + foreach my $token (@tokens) { + if($token !~ /^(ads|bib|end|isi|ris|wordbib)$/){ + print STDERR "#! Invalid export type \"$token\"\n"; + Help(); exit (0); + } + + $typeHash{$token} = 1; + } + + # get all export types sorted + @exportTypes = sort {$a cmp $b} keys %typeHash; +} + + my $textFile; if($isXmlInput){ # extracting text from Omnipage XML output $textFile = "/tmp/". newTmpFile(); @@ -133,7 +165,6 @@ sub Version { unlink($sectLabelInput); } } -### End Thang v100401: add input type option, and SectLabel ### if (($mode & $PARSHED) == $PARSHED) { # PARSHED use ParsHed::Controller; @@ -145,6 +176,11 @@ sub Version { use ParsCit::Controller; my $pcXML = ParsCit::Controller::extractCitations($textFile, $isXmlInput); $rXML .= removeTopLines($$pcXML, 1) . "\n"; # remove first line + + # Thang v100901: call to BiblioScript + if(scalar(@exportTypes) != 0){ + biblioScript(\@exportTypes, $$pcXML, $out); + } } $rXML .= "
"; @@ -190,7 +226,7 @@ sub parseMode { } } -# Thang v100401: remove top n lines +# remove top n lines sub removeTopLines { my ($input, $topN) = @_; # remove first line @@ -228,9 +264,44 @@ sub sectLabel { return $$slXML; } -# Thang v100401: method to generate tmp file name +# Thang v100901: incorporate BiblioScript +sub biblioScript { + my ($types, $pcXML, $outFile) = @_; + + my @exportTypes = @{$types}; + my $tmpDir = "/tmp/".newTmpFile(); + system("mkdir -p $tmpDir"); + + # write extract_citation output to a tmp file + my $fileName = "$tmpDir/input.txt"; + open(OF, ">:utf8", $fileName); + print OF "$pcXML"; + close OF; + + # call to BiblioScript + my $size = scalar(@exportTypes); + my $format = $exportTypes[0]; + my $cmd = "$biblioScript -q -i parscit -o $format $fileName $tmpDir"; + system($cmd); + system("mv $tmpDir/parscit.$format $outFile.$format"); + + # reuse the MODS file generated in the first call + for(my $i = 1; $i<$size; $i++){ + $format = $exportTypes[$i]; + $cmd = "$biblioScript -q -i mods -o $format $tmpDir/parscit_mods.xml $tmpDir"; + system($cmd); + system("mv $tmpDir/parscit.$format $outFile.$format"); + } + + #print STDERR "$tmpDir\n"; + system("rm -rf $tmpDir"); +} + +# method to generate tmp file name sub newTmpFile { my $tmpFile = `date '+%Y%m%d-%H%M%S-$$'`; chomp($tmpFile); return $tmpFile; } + + diff --git a/bin/out.txt b/bin/out.txt deleted file mode 100644 index 9fcd418..0000000 --- a/bin/out.txt +++ /dev/null @@ -1,1516 +0,0 @@ - - - - - -Coupling Feature Selection and Machine Learning -Methods for Navigational Query Identification -Yumao Lu Fuchun Peng - - -Xin Li Nawaaz Ahmed - - -Yahoo! Inc. - -
-701 First Avenue -Sunnyvale, California 94089 -
- -fyumaol, fuchun, xinli, nawaazj@yahoo-inc.com - - -ABSTRACT - - -It is important yet hard to identify navigational queries in -Web search due to a lack of sufficient information in Web -queries, which are typically very short. In this paper we -study several machine learning methods, including naive -Bayes model, maximum entropy model, support vector ma- -chine (SVM), and stochastic gradient boosting tree (SGBT), -for navigational query identification in Web search. To boost -the performance of these machine techniques, we exploit sev- -eral feature selection methods and propose coupling feature -selection with classification approaches to achieve the best -performance. Different from most prior work that uses a -small number of features, in this paper, we study the prob- -lem of identifying navigational queries with thousands of -available features, extracted from major commercial search -engine results, Web search user click data, query log, and -the whole Web’s relational content. A multi-level feature -extraction system is constructed. -Our results on real search data show that 1) Among all -the features we tested, user click distribution features are the -most important set of features for identifying navigational -queries. 2) In order to achieve good performance, machine -learning approaches have to be coupled with good feature -selection methods. We find that gradient boosting tree, cou- -pled with linear SVM feature selection is most effective. 3) -With carefully coupled feature selection and classification -approaches, navigational queries can be accurately identi- -fied with 88.1% F1 score, which is 33% error rate reduction -compared to the best uncoupled system, and 40% error rate -reduction compared to a well tuned system without feature -selection. - - -Categories and Subject Descriptors - - -H.4 [Information Systems Applications]: Miscellaneous - - -*Dr. Peng contributes to this paper equally as Dr. Lu. -Permission to make digital or hard copies of all or part of this work for -personal or classroom use is granted without fee provided that copies are -not made or distributed for profit or commercial advantage and that copies -bear this notice and the full citation on the first page. To copy otherwise, to -republish, to post on servers or to redistribute to lists, requires prior specific -permission and/or a fee. - - -CIKM’06, November 5–11, 2006, Arlington, Virginia, USA. - - -Copyright 2006 ACM 1-59593-433-2/06/0011 ...$5.00. - - -General Terms -Experimentation -Keywords - - -Navigational Query Classification, Machine Learning - - -1. INTRODUCTION - - -Nowadays, Web search has become the main method for -information seeking. Users may have a variety of intents -while performing a search. For example, some users may -already have in mind the site they want to visit when they -type a query; they may not know the URL of the site or -may not want to type in the full URL and may rely on the -search engine to bring up the right site. Yet others may have -no idea of what sites to visit before seeing the results. The -information they are seeking normally exists on more than -one page. -Knowing the different intents associated with a query may -dramatically improve search quality. For example, if a query -is known to be navigational, we can improve search results -by developing a special ranking function for navigational -queries. The presentation of the search results or the user- -perceived relevance can also be improved by only showing -the top results and reserving the rest of space for other pur- -poses since users only care about the top result of a nav- -igational query. According to our statistics, about 18% of -queries in Web search are navigational (see Section 6). Thus, -correctly identifying navigational queries has a great poten- -tial to improve search performance. -Navigational query identification is not trivial due to a -lack of sufficient information in Web queries, which are nor- -mally short. Recently, navigational query identification, or -more broadly query classification, is drawing significant at- -tention. Many machine learning approaches that have been -used in general classification framework, including naive Bayes -classifier, maximum entropy models, support vector ma- -chines, and gradient boosting tree, can be directly applied -here. However, each of these approaches has its own advan- -tages that suit certain problems. Due to the characteristics -of navigational query identification (more to be addressed -in Section 2 ), it is not clear which one is the best for the -task of navigational query identification. Our first contri- -bution in this paper is to evaluate the effectiveness of these -machine learning approaches in the context of navigational -query identification. To our knowledge, this paper is the -very first attempt in this regard. - - -682 - - -Machine learning models often suffer from the curse of -feature dimensionality. Feature selection plays a key role -in many tasks, such as text categorization [18]. In this pa- -per, our second contribution is to evaluate several feature -selection methods and propose coupling feature selection -with classification approaches to achieve the best perfor- -mance: ranking features by using one algorithm before an- -other method is used to train the classifier. This approach is -especially useful when redundant low quality heterogeneous -features are encountered. -Most previous studies in query identification are based on -a small number of features that are obtained from limited -resources [12]. In this paper, our third contribution is to -explore thousands of available features, extracted from ma- -jor commercial search engine results, user Web search click -data, query log, and the whole Web’s relational content. To -obtain most useful features, we present a three level system -that integrates feature generation, feature integration, and -feature selection in a pipe line. -The system, after coupling features selected by SVM with -a linear kernel and stochastic gradient boosting tree as clas- -sification training method, is able to achieve an average per- -formance of 88.1% F1 score in a five fold cross-validation. -The rest of this paper is organized as follows. In the next -section, we will define the problem in more detail and de- -scribe the architecture of our system. We then present a -multi-level feature extraction system in Section 3. We de- -scribe four classification approaches in Section 4 and three -feature selection methods in Section 5. We then conduct -extensive experiments on real search data in Section 6. We -present detailed discussions in Section 7. We discuss some -related work in Section 8. Finally, we conclude the paper in -Section 9. - - -2. PROBLEM DEFINITION - - -We divide queries into two categories: navigational and -informational. According to the canonical definition [3, 14], -a query is navigational if a user already has a Web-site in -mind and the goal is simply to reach that particular site. -For example, if a user issues query “amazon”, he/she mainly -wants to visit “amazon.com”. This definition, however, is -rather subjective and not easy to formalize. In this paper, -we extend the definition of navigational query to a more -general case: a query is navigational if it has one and only -one perfect site in the result set corresponding to this query. -A site is considered as perfect if it contains complete infor- -mation about the query and lacks nothing essential. -In our definition, navigational query must have a corre- -sponding result page that conveys perfectness, uniqueness, -and authority. Unlike Broder’s definition, our definition -does not require the user to have a site in mind. This makes -data labeling more objective and practical. For example, -when a user issues a query “Fulton, NY”, it is not clear -if the user knows the Web-site “www.fultoncountyny.org”. -However, this Web-site has an unique authority and per- -fect content for this query and therefore the query “Fulton, -NY” is labeled as a navigational query. All non-navigational -queries are considered informational. For an informational -query, typically there exist multiple excellent Web-sites cor- -responding to the query that users are willing to explore. -To give another example, in our dataset, query “national -earth science teachers association” has only one perfect cor- -responding URL “http://www.nestanet.org/” and therefore -is labeled as navigational query. Query “Canadian gold -maple leaf” has several excellent corresponding URL’s, in- -cluding “http://www. goldfingercoin.com/ catalog gold/ cana- -dian maple leaf.htm”, “http://coins.about.com/ library/weekly/ -aa091802a.htm” and “http://www.onlygold.com/Coins/ Cana- -dianMapleLeafsFullScreen.asp”. Therefore, query “Cana- -dian gold maple leaf” is labeled as non-navigational query. -Figure 1 illustrates the architecture of our navigational -query identification system. A search engine takes in a query -and returns a set of URLs. The query and returned URLs -are sent into a multi-level feature extraction system that -generates and selects useful features; details are presented -in the next section. Selected features are then input into a -machine learning tool to learn a classification model. - - -3. MULTI-LEVEL FEATURE EXTRACTION - - -The multiple level feature system is one of the unique -features of our system. Unlike prior work with a limited -number of features or in a simulated environment [11, 12], -our work is based on real search data, a major search en- -gine’s user click information and a query log. In order to -handle large amount of heteorgeneous features in an effi- -cient way, we propose a multi-level feature system. The first -level is the feature generation level that calculates statistics -and induces features from three resources: a click engine, -a Web-map and a query log. The second level is responsi- -ble for integrating query-URL pair-wise features into query -features by applying various functions. The third level is -a feature selection module, which ranks features by using -different methods. Below we present the details of the first -two levels. The third level will be presented separately in -Section 5 since those feature selection methods are standard. - - -3.1 Feature Generation - - -Queries are usually too short and lack sufficient context -to be classified. Therefore, we have to generate more fea- -tures from other resources. We use three resources to gen- -erate features: a click engine, a Web-map, and query logs. -The click engine is a device to record and analyze user click -behavior. It is able to generate hundreds of features auto- -matically based on user click through distributions [16]. A -Web-map can be considered as a relational database that -stores hundreds of induced features on page content, an- -chor text, hyperlink structure of webpages, including the -inbound, outbound URLs, and etc. Query logs are able to -provide bag-of-words features and various language model -based features based on all the queries issued by users over -a period of time. -Input to feature generation module is a query-URL pair. -For each query, the top 100 ULRs are recorded and 100 -query-URLs are generated. Thus for each query-URL pair, -we record a total of 197 features generated from the following -four categories: -9 Click features: Click features record the click informa- -tion about a URL. We generate a total number of 29 -click features for each query-URL pair. An example of -a click feature is the click ratio (CR). Let nzk denote -the number of clicks on URL k for query i and total -number of clicks - - -nz =X nzk. -k - - -683 - -
-Classifier -query -Classification module -Naive Bayes -MaxEnt -SVM -SGBT -Search engine -Selected feature -query—URL -Boosting feature selection -Information gain -SVM feature ranking -Feature generation -Feature selection module -Webmap -Click engine -Query log -query—url feature -Feature integration -Min -Max -Entropy -... -Integrated feature -
- -Figure 1: Diagram of Result Set Based Navigational Query Identification System - - -The click ratio is the ratio of number of clicks on a -particular URL K for query i to the total number of -clicks for this query, which has the form - - -CR(i, K) = ni K -ni -9 URL features: URL features measure the characteris- - - -tics of the URL itself. There are 24 URL based features -in total. One such feature is a URL match feature, -named urlmr, which is defined as -urlmr = l(u) -where l(p) is the length of the longest substring p of the -query that presents in the URL and l(u) is the length -of the URL u. This feature is based on the observation -that Web-sites tend to use their names in the URL’s. -The distributions confers uniqueness and authority. -9 Anchor text features: Anchor text is the visible text in -a hyperlink, which also provides useful information for -navigational query identification. For example, one an- -chor text feature is the entropy of anchor link distribu- -tion [12]. This distribution is basically the histogram -of inbound anchor text of the destination URL. If an -URL is pointed to by the same anchor texts, the URL -is likely to contain perfect content. There are many -other anchor text features that are calculated by con- -sidering many factors, such as edit distance between -query and anchor texts, diversity of the hosts, etc. In -total, there are 63 features derived from anchor text. -Since we record the top 100 results for each query and -each query URL pair has 197 features, in total there are -19,700 features available for each query. Feature reduction -becomes necessary due to curse of dimensionality [5]. Before -applying feature selection, we conduct a feature integration -procedure that merges redundant features. - - -3.2 Feature Integration - - -We design a feature integration operator, named normal- -ized ratio rk of rank k, as follows: - - -rk(fj) = max(fj) − fjk -k=2,5,10,20.max(fj) − min(fj) (1) - - -The design of this operator is motivated by the obser- -vation that the values of query-URL features for naviga- -tional query and informational query decrease at different -rates. Taking the urlmr feature for example and consider- -ing a navigational query “Walmart” and an informational -query “Canadian gold maple leaf”, we plot the feature val- -ues of top 100 URLs for both queries, as shown in Figure 2. -As we can see, the feature value for the navigational query -drops quickly to a stable point, while an information query -is not stable. As we will see in the experiment section, this -operator is most effective in feature reduction. -Besides this operator, we use other statistics for feature -integration, including mean, median, maximum, minimum, -entropy, standard deviation and value in top five positions -of the result set query-URL pair features. In total, we now -have 15 measurements instead of 100 for the top 100 URLs -for each query. Therefore, for each query, the dimension of -a feature vector is m = 15 x 197 = 2955, which is much -smaller than 197, 000. - - -4. CLASSIFICATION METHODS - - -We apply the most popular generative (such as naive Bayes -method), descriptive (such as Maximum Entropy method), -and discriminative (such as support vector machine and -stochastic gradient boosting tree) learning methods [19] to -attack the problem. - - -4.1 Naive Bayes Classifier - - -A simple yet effective learning algorithm for classification -l(p) - - -684 - -
-Query: &apos;Walmart&apos; -Rank -Query: &quot;Canadian gold maple leaf&apos; -0.5 -00 20 40 60 80 100 -Rank -
- -Figure 2: urlmr query-URL feature for navigational - - -query (upper) and a informational query (lower) -is based on a simple application of Bayes’ rule - - -P(yl q) = P(y) x P(ql y) (2) -P(q) - - -In query classification, a query q is represented by a vector of -K attributes q = (v1, v2, ....vK). Computing p(qly) in this -case is not trivial, since the space of possible documents -q = (v1, v2, ....vK) is vast. To simplify this computation, -the naive Bayes model introduces an additional assumption -that all of the attribute values, vj, are independent given -the category label, c. That is, for i =� j, vi and vj are -conditionally independent given q. This assumption greatly -simplifies the computation by reducing Eq. (2) to - - -P(q) (3) - - -Based on Eq. (3), a maximum a posteriori (MAP) classifier -can be constructed by seeking the optimal category which -maximizes the posterior P(cld): -(flP(vjly)) arg maYx (5) -Eq. (5) is called the maximum likelihood naive Bayes classi- -fier, obtained by assuming a uniform prior over categories. -To cope with features that remain unobserved during train- -ing, the estimate of P(vjly) is usually adjusted by Laplace -smoothing -Nyj + aj (6) -Ny + a -where Ny j is the frequency of attribute j in Dy, Ny = -Ej Nyj, and a = Ej aj. A special case of Laplace smooth- -ing is add one smoothing, obtained by setting aj = 1. We -use add one smoothing in our experiments below. - - -4.2 Maximum Entropy Classifier - - -Maximum entropy is a general technique for estimating -probability distributions from data and has been success- -fully applied in many natural language processing tasks. -The over-riding principle in maximum entropy is that when -nothing is known, the distribution should be as uniform as -possible, that is, have maximal entropy [9]. Labeled train- -ing data are used to derive a set of constraints for the model -that characterize the class-specific expectations for the dis- -tribution. Constraints are represented as expected values -of features. The improved iterative scaling algorithm finds -the maximum entropy distribution that is consistent with -the given constraints. In query classification scenario, max- -imum entropy estimates the conditional distribution of the -class label given a query. A query is represented by a set -of features. The labeled training data are used to estimate -the expected value of these features on a class-by-class basis. -Improved iterative scaling finds a classifier of an exponential -form that is consistent with the constraints from the labeled -data. -It can be shown that the maximum entropy distribution -is always of the exponential form [4]: -where each fi (q; y) is a feature, λi is a parameter to be -estimated and Z(q) is simply the normalizing factor to en- -sure a proper probability: Z(q) = Ey exp(Ei λi f i(q; y)). -Learning of the parameters can be done using generalized -iterative scaling (GIS), improved iterative scaling (IIS), or -quasi-Newton gradient-climber [13]. - - -4.3 Support Vector Machine - - -Support Vector Machine (SVM) is one of the most suc- -cessful discriminative learning methods. It seeks a hyper- -plane to separate a set of positively and negatively labeled -training data. The hyperplane is defined by wT x + b = 0, -where the parameter w E Rm is a vector orthogonal to the -hyperplane and b E R is the bias. The decision function is -the hyperplane classifier - - -H(x) = sign(wTx + b). - - -The hyperplane is designed such that yi (wT xi + b) &gt; 1 — -ξi, `di = 1, ..., N, where xi E Rm is a training data point -and yi E {+1, —1� denotes the class of the vector xi. The -margin is defined by the distance between the two parallel -hyperplanes wT x +b = 1 and wT x + b = —1, i.e. 2/llwll2. -The margin is related to the generalization of the classifier -[17]. The SVM training problem is defined as follows: -minimize (1/2)wT w + γ1T ξ -subject to yi(wT xi + b) &gt; 1 — ξi, i = 1, ..., N (7) - -
-Ξ &gt;0 -0.4 -0.35 -0.3 -0.25 -0.2 -0.15 -0.1 -0.05 -00 20 40 60 80 100 -0.4 -0.3 -0.2 -0.1 -
- -( ) -K -= argmaYx P(y) x P(vjly) -y∈ ri -y* -j=1 -(4) -P(ylq) = P(y) x P(vjly) -F�77K -l 1j=1 -P(vjly) = -P(ylq) = 1 -exp(X λi fi(q; y)) -Z(q) i - - -685 - - -where the scalar γ is called the regularization parameter, -and is usually empirically selected to reduce the testing error -rate. -The basic SVM formulation can be extended to the non- -linear case by using nonlinear kernels. Interestingly, the -complexity of an SVM classifier representation does not de- -pend on the number of features, but rather on the number of -support vectors (the training examples closest to the hyper- -plane). This property makes SVMs suitable for high dimen- -sional classification problems [10]. In our experimentation, -we use a linear SVM and a SVM with radial basis kernel. - - -4.4 Gradient Boosting Tree - - -Like SVM, gradient boosting tree model also seeks a pa- -rameterized classifier. It iteratively fits an additive model [8] - - -T -ft(x) = Tt(x; Θ0) + λ X -t=1 - - -such that certain loss function L(yi, fT(x + i) is minimized, -where Tt(x; Θt) is a tree at iteration t, weighted by param- -eter βt, with a finite number of parameters, Θt and λ is the -learning rate. At iteration t, tree Tt(x;β) is induced to fit -the negative gradient by least squares. That is -statistics. Yang and Pedersen [18] gives a good compari- -son of these methods. Information gain is one of the most -effective methods in the context of text categorization. In -addition to information gain, we also use feature selection -methods based on SVM’s feature coefficients and stochastic -gradient boosting tree’s variable importance. - - -5.1 Information Gain - - -Information gain is frequently used as a measure of fea- -ture goodness in text classification [18]. It measures the -number of bits of information obtained for category predic- -tion by knowing the presence or absence of a feature. Let - - -yi : i = 1..m be the set of categories, information gain of a -feature f is defined as -P(yi)logP(yi) -+ P(f ) Xm P(yi1 f)logP(yi1f) -i=1 -+ P(f ) Xm P(yi1 f)logP(yi1f) -i=1 -Xm -i=1 -IG(f) = — -βtTt(x; Θt), -Θˆ := arg min -β -XN -i -(—Git — βt Tt(xi); Θ)2, - - -where f indicates f is not present. We compute the infor- -mation gain for each unique feature and select top ranked -features. -where Git is the gradient over current prediction function -»∂L(yi, f (xi) – -∂f (xi) f=ft-i -The optimal weights of trees βt are determined -L(yi, ft−1(xi) +βT(xi, Θ)). -If the L-2 loss function [yi — f (xi)]2/2 is used, we have the -gradient G(xi) = —yi + f (xi). In this paper, the Bernoulli -loss function - - -—2X (yif(xi) — log(1 + exp(f(xi)))) -i -is used and the gradient has the form -_ 1 -G(xi) — yi — 1 + exp(—f (xi)). - - -During each iteration of gradient boosting, the feature -space is further partitioned. This kind of rectangular parti- -tion does not require any data preprocessing and the result- -ing classifier can be very robust. However, it may suffer from -the dead zoom phenomenon, where prediction is not able to -change with features, due to its discrete feature space par- -tition. Friedman (2002) found that it helps performance by -sampling uniformly without replacement from the dataset -before estimating the next gradient step [6]. This method -was called stochastic gradient boosting. - - -5. FEATURE SELECTION - - -Many methods have been used in feature selection for -text classification, including information gain, mutual infor- -mation, document frequency thresholding, and Chi-square - - -5.2 Linear SVM Feature Ranking - - -Linear SVM (7) produces a hyperplane as well as a nor- -mal vector w. The normal vector w serves as the slope of -the hyperplane classifier and measures the relative impor- -tance that each feature contribute to the classifier. An ex- -treme case is that when there is only one feature correlated -to sample labels, the optimal classifier hyperplane must be -perpendicular to this feature axle. -The L-2 norm of w, in the objective, denotes the inverse -margin. Also, it can be viewed as a Gaussian prior of random -variable w. Sparse results may be achieved by assuming a -laplace prior and using the L-1 norm [2]. -Unlike the previous information gain method, the linear -SVM normal vector w is not determined by the whole body -of training samples. Instead, it is determined by an opti- -mally determined subset, support vectors, that are critical -to be classified. Another difference is obvious: normal vec- -tor w is solved jointly by all features instead of one by one -independently. -Our results show that linear SVM is able to provide rea- -sonably good results in feature ranking for our navigational -query identification problem even when the corresponding -classifier is weak. - - -5.3 Stochastic Gradient Boosting Tree - - -Boosting methods construct weak classifiers using subsets -of features and combines them by considering their predica- -tion errors. It is a natural feature ranking procedure: each -feature is ranked by its related classification errors. -Tree based boosting methods approximate relative influ- -ence of a feature xj as - - -XJ2j = I2 k -splits on xj -Git = -. -βt = arg min -β -XN -i - - -686 - - -where I2 k is the empirical improvement by k-th splitting on -xj at that point. -Unlike the information gain model that considers one fea- -ture at a time or the SVM method that considers all the -feature at one time, the boosting tree model considers a set -of features at a time and combines them according to their -empirical errors. -Let R(X) be a feature ranking function based on data set -X. Information gain feature ranking depends on the whole -training set RInfo(X) = RInfo(Xtr). Linear SVM ranks fea- -tures is based on a set of optimally determined dataset. That -is, RSVM(X) = RSVM(XSV), where XSV is the set of sup- -port vectors. The stochastic gradient boosting tree (GSBT) -uses multiple randomly sampled data to induce trees and -ranks feature by their linear combination. Its ranking func- -tion can be written as RSGBT(X) = PTt=1βtRtSGBT(Xt), -where Xt is the training set randomly sampled at iteration -t. - - -6. EXPERIMENTS - - -6.1 Data Set - - -A total number of 2102 queries were uniformly sampled -from a query log over a four month period. The queries -were sent to four major search engines, including Yahoo, -Google, MSN, and Ask. The top 5 URL’s returned by each -search engine were recorded and sent to trained editors for -labeling (the number 5 is just an arbitrary number we found -good enough to measure the quality of retrieval). If there -exists one and only one perfect URL among all returned -URLs for a query, this query is labeled as navigational query. -Otherwise, it is labeled as non-navigational query. -Out of 2102 queries, 384 queries are labeled as naviga- -tional. Since they are uniformly sampled from a query log, -we estimate there are about 18% queries are navigational. -The data set were divided into five folders for the purpose -of cross-validation. All results presented in this section are -average testing results in five fold cross validations. - - -6.2 Evaluation - - -Classification performance is evaluated using three met- -rics: precision, recall and F1 score. In each test, Let n++ -denotes the number of positive samples that correctly clas- -sified (true positive); n_+ denotes the number of negative -samples that are classified as positive (false positive); n+_ -denotes the number of false positive samples that are classi- -fied as negative (false negative); and n__ denotes the num- -ber of negative samples that are correctly classified (true -negative). Recall is the ratio of the number of true positives -to the total number of positives samples in the testing set, - - -namely -recall = n++ . -n++ + n+_ - - -Precision is the ratio of the number of true positive samples -to the number samples that are classified as positive, namely - - -precision = n++ . -n++ + n_+ - - -F1 is a single score that combines precision and recall, -defined as follows: - - -F1 = 2 × precsion × recall - - -precsion + recall . - - -6.3 Results -6.3.1 Feature Selection Results - - -Table 1 shows the distributions of the top 50 features se- -lected by different methods. All methods agree that click -features are the most important. In particular, linear SVM -and boosting tree select more click features than informa- -tion gain. On the other hand, information gain select many -features from anchor text and other metrics such as spam -scores. - - -Table 1: Distributions of the Selected Top 50 Fea- -tures According to Feature Categories - -
-Feature Set -Info. Gain -Linear SVM -Boosting -Click -52% -84% -74% -URL -4% -2% -6% -Anchor Text -18% -2% -12% -Other metrics -26% -12% -
- -8% - - -Table 2 shows the distribution of the selected features ac- -cording to feature integration operators. It shows which -operators applied to result set query-URL pair wise features -are most useful. We group the 15 operators into 5 types: -vector, normalized ratios (rk, k = 2, 5, 10, 20), min/max, en- -tropy/stand deviation, and median/mean. Vector group in- -cludes all query-URL pair features in top 5 positions; nor- -malized ratios are defined in (1). As we can see from the -table, all feature integration operators are useful. - - -Table 2: Distributions of the Selected Top 50 Fea- -tures According to Integration Operators - -
-Operators -Info. Gain -Linear SVM -Boosting -vector -40% -22% -28% -normalized ratios -8% -38% -22% -min/max -6% -20% -16% -entropy/std -20% -16% -18% -mean/median -26% -4% -
- -16% - - -The number of selected features directly influence the clas- -sification performance. Figure 3 shows relationship between -the boosting tree classification performance and the number -of selected features. As we can see, performance increases -with cleaner selected features. However, if the number of -selected feature is too small, performance will decrease. A -number of 50 works the best in our work. - - -6.3.2 Classification Results - - -We first apply four different classification methods: naive -Bayes, maximum entropy methods, support vector machine -and stochastic gradient boosting tree model over all available -features. The results are reported in Table 3. As we can see, -stochastic gradient boosting tree has the best performance -with an F1 score of 0.78. -We then apply those methods to machine selected fea- -tures. We test 4 different feature sets with 50 number of fea- -tures, selected by information gain, linear SVM and boosting -tree. The combined set consists of 30 top features selected by -linear SVM and 29 top features selected by boosting tree. -Please note that the total number of features are still 50 -since linear SVM and boosting tree selected 9 same features -in their top 30 feature set. - - -687 - - -Classification Performance VS Number of Features -Number of Features Selected By Boosting Tree -Figure 3: Classification performance F1 against -number of features: 25, 50, 100, 200, 400, 800, and -2955 (all features) -
- -Table 3: Results of Various Classification Methods - -
-over All Features -Recall -Precision -F1 -Naive Bayes -0.242 -0.706 -0.360 -SVM (Linear Kernel) -0.189 -1.000 -0.318 -Maximum Entropy -0.743 -0.682 -0.711 -SVM (RBF Kernel) -0.589 -0.485 -0.528 -Boosting Trees -0.724 -0.845 -0.780 -
- -Table 4 presents the results of the coupled feature selec- -tion and classification methods. It is obvious that the perfor- -mance of each method is improved by applying them to ma- -chine selected clean features, except naive Bayes classifier. -Surprisingly, the features selected by linear SVM are the -best set of features. The results show that even if the under- -lying problem is not linear separable, the linear coefficients -of the large margin linear classifier still convey important -feature information. When the stochastic gradient boost- -ing tree is applied over this set of features, we get the best -performance with 0.881 F1 score among all cross-methods -evaluations. Without feature ablation, SGBT is only able -to achieve 0.738 F1 score. That is, feature selection has -an effect of error reduction rate 40%. Without introducing -linear SVM in feature ablation, if SGBT works on the fea- -ture set selected by its own variable importance ranking, it -achieves 0.848 F1 score. That is to say, a cross methods -coupling of feature selection and classification causes a 33% -error reduction. - - -7. DISCUSSION - - -An interesting result from Table 1 is the features selected -for navigational query identification. Those features are -mostly induced from user click information. This is intu- -itively understandable because if a query is navigational, -the navigational URL is the most clicked one. On the other -hand, it might be risky to completely rely on click infor- -mation. The reasons might be 1) user click features may -be easier to be spammed, and 2) clicks are often biased by -various presentation situation such as quality of auto ab- -straction, etc. -From Table 4, we observe that linear SVM and boosting -tree have better feature selection power than information -gain. The reason that information gain performs inferior to -linear SVM and boosting tree is probably due to the fact -that information gain considers each feature independently -while linear SVM considers all features jointly and boosting -tree composites feature rank by sum over all used features. -The results show that URL, anchor text and other metrics -are helpful only when they are considered jointly with click -features. -The most important result is that the stochastic gradi- -ent boosting tree coupled with linear SVM feature selection -method achieves much better results than any other combi- -nation. In this application, the data has very high dimension -considering the small sample size. The boosting tree method -needs to partition an ultra-high dimensional feature space -for feature selection. However, the stochastic step does not -have enough data to sample from [6]. Therefore, the boosted -result might be biased by earlier sampling and trapped in -a local optimum. Support vector machine, however, is able -to find an optimally determined subset of training samples, -namely support vectors, and ranks features based on those -vectors. Therefore, the SVM feature selection step makes -up the disadvantage of the stochastic boosting tree in its -initial sampling and learning stages that may lead to a local -optimum. -As expected, naive Bayes classifier hardly works for the -navigational query identification problem. It is also the only -classifier that performs worse with feature selection. Naive -Bayes classifiers work well when the selected features are -mostly orthogonal. However, in this problem, all features -are highly correlated. On the other hand, classification -methods such as boosting tree, maximum entropy model -and SVM do not require orthogonal features. - - -8. RELATED WORK - - -Our work is closely related to query classification, a task of -assigning a query to one or more categories. However, gen- -eral query classification and navigational query identifica- -tion are different in the problems themselves. Query classi- -fication focuses on content classification, thus the classes are -mainly topic based, such as shopping and products. While -in navigational query identification, the two classes are in- -tent based. -In the classification approaches regard, our work is re- -lated to Gravano, et al. [7] where authors applied various -classification methods, including linear and nonlinear SVM, -decision tree and log-linear regression to classify query lo- -cality based on result set features in 2003. Their work, -however, lacked carefully designed feature engineering and -therefore only achieved a F1 score of 0.52 with a linear SVM. -Beitzel, et al.[1] realized the limitation of a single classifica- -tion method in their query classification problem and pro- -posed a semi-supervised learning method. Their idea is to -compose the final classifier by combining classification re- -sults of multiple classification methods. Shen, et al. [15] -also trained a linear combination of two classifiers. Differ- -ently, instead of combining two classifiers for prediction, we -couple feature selection and classification. -In the feature extraction aspect, our work is related to -Kang and Kim 2003 [11] where authors extracted heteroge- -nous features to classify user queries into three categories: -topic relevance task, the homepage finding task and service -finding task. They combined those features, for example -URL feature and content feature, by several linear empiri- - -
-0.86 -0.85 -0.84 -0.83 -0.82 -0.81 -0.8 -0.79 -0.780 500 1000 1500 2000 2500 3000 -688 -Table 4: F1 Scores of Systems with Coupled Feature Selection and Classification Methods -Methods -Info. Gain -Linear SVM -Boosting -Combined Set -SVM (Linear Kernel) -0.124 -0.733 -0.712 -0.738 -Naive Bayes -0.226 -0.182 -0.088 -0.154 -Maximum Entropy -0.427 -0.777 -0.828 -0.784 -SVM (RBF Kernel) -0.467 -0.753 -0.728 -0.736 -Boosting Tree -0.627 -0.881 -0.848 -
- -0.834 - - -cal linear functions. Each function was applied to a different -binary classification problem. Their idea was to empha- -size features for different classification purposes. However, -the important features were not selected automatically and -therefore their work is not applicable in applications with -thousands of features. - - -9. CONCLUSION - - -We have made three contributions in the paper. First, -we evaluate the effectiveness of four machine learning ap- -proaches in the context of navigational query identification. -We find that boosting trees are the most effective one. Sec- -ond, we evaluate three feature selection methods and pro- -pose coupling feature selection with classification approaches. -Third, we propose a multi-level feature extraction system to -exploit more information for navigational query identifica- -tion. -The underlying classification problem has been satisfacto- -rily solved with 88.1% F1 score. In addition to the successful -classification, we successfully identified key features for rec- -ognizing navigational queries: the user click features. Other -features, such as URL, anchor text, etc. are also important -if coupled with user click features. -In future research, it is of interest to conduct cross meth- -ods co-training for the query classification problem to utilize -unlabeled data, as there is enough evidence that different -training methods may benefit each other. - - -10. REFERENCES - - -[1] S. Beitzel, E. Jensen, D. Lewis, A. Chowdhury, -A. Kolcz, and O. Frieder. Improving Automatic Query -Classification via Semi-supervised Learning. In The -Fifth IEEE International Conference on Data Mining, -pages 27–30, New Orleans, Louisiana, November 2005. -[2] C. Bhattacharyya, L. R. Grate, M. I. Jordan, L. El -Ghaoui, and I. S. Mian. Robust Sparse Hyperplane -Classifiers: Application to Uncertain Molecular -Profiling Data. Journal of Computational Biology, -11(6):1073–1089, 2004. -[3] A. Broder. A Taxonomy of Web Search. In ACM -SIGIR Forum, pages 3–10, 2002. -[4] S. della Pietra, V. della Pietra, and J. Lafferty. -Inducing Features of Random Fields. IEEE -Transactions on Pattern Analysis and Machine -Intelligence, 19(4), 1995. -[5] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern -Classification. John Wiley, New York, NY, 2nd -edition, 2000. -[6] J. H. Friedman. Stochastic Gradient Boosting. -Computational Statistics and Data Analysis, -38(4):367–378, 2002. -[7] L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein. -Categorizing Web Queries According to Geographical -Locality. In ACM 12th Conference on Information -and Knowledge Management (CIKM), pages 27–30, -New Orleans, Louisiana, November 2003. -[8] T. Hastie, R. Tibshirani, and J. Friedman. The -Elements of Statistical Learning: Data Mining, -Inference, and Predication. Springer Verlag, New -York, 2001. -[9] E. T. Jaynes. Papers on Probability, Statistics, and -Statistical Physics. D. Reidel, Dordrecht, Holland and -Boston and Hingham, MA, 1983. -[10] T. Joachims. Text Categorization with Support Vector -Machines: Learning with Many Relevant Features. In -Proceedings of the 10th European Conference on -Machine Learning (ECML), pages 137–142, Chemnitz, -Germany, 1998. -[11] I.-H. Kang and G. Kim. Query Type Classification for -Web Document Retrieval. In Proceedings of the 26th -annual international ACM SIGIR conference on -Research and development in informaion retrieval, -pages 64 – 71, Toronto Canada, July 2003. -[12] U. Lee, Z. Liu, and J. Cho. Automatic Identification -of User Goals in Web Search. In Proceedings of the -14th International World Wide Web Conference -(WWW), Chiba, Japan, 2005. -[13] R. Malouf. A Comparison of Algorithms for Maximum -Entropy Parameter Estimation. In Proceedings of the -Sixth Conference on Natural Language Learning -(CoNLL), Taipei, China, 2002. -[14] D. E. Rose and D. Levinson. Understanding User -Goals in Web Search. In Proceedings of The 13th -International World Wide Web Conference (WWW), -2004. -[15] D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, -and Q. Yang. Q2C at UST: Our Winning Solution to -Query Classification in KDDCUP 2005. SIGKDD -Explorations, 7(2):100–110, 2005. -[16] L. Sherman and J. Deighton. Banner advertising: -Measuring effectiveness and optimizing placement. -Journal of Interactive Marketing, 15(2):60–64, 2001. -[17] V. Vapnik. The Nature of Statistical Learning Theory. -Springer Verlag, New York, 1995. -[18] Y. Yang and J. Pedersen. An Comparison Study on -Feature Selection in Text Categorization. In -Proceedings of the 20th annual international ACM -SIGIR conference on Research and development in -informaion retrieval, Philadelphia, PA, USA, 1997. -[19] S.C. Zhu. Statistical modeling and conceptualization -of visual patterns. IEEE Transactions on Pattern -Analysis and Machine Intelligence, 25(6):619–712, -2003. - - -689 - -
-
- - -Coupling Feature Selection and Machine Learning Methods for Navigational Query Identification -Yumao Lu Fuchun Peng Xin Li Nawaaz Ahmed -Yahoo! Inc -
701 First Avenue Sunnyvale, California 94089
-fyumaol,fuchun,xinli,nawaazj@yahoo-inc.com -It is important yet hard to identify navigational queries in Web search due to a lack of sufficient information in Web queries, which are typically very short. In this paper we study several machine learning methods, including naive Bayes model, maximum entropy model, support vector machine (SVM), and stochastic gradient boosting tree (SGBT), for navigational query identification in Web search. To boost the performance of these machine techniques, we exploit several feature selection methods and propose coupling feature selection with classification approaches to achieve the best performance. Different from most prior work that uses a small number of features, in this paper, we study the problem of identifying navigational queries with thousands of available features, extracted from major commercial search engine results, Web search user click data, query log, and the whole Web’s relational content. A multi-level feature extraction system is constructed. Our results on real search data show that 1) Among all the features we tested, user click distribution features are the most important set of features for identifying navigational queries. 2) In order to achieve good performance, machine learning approaches have to be coupled with good feature selection methods. We find that gradient boosting tree, coupled with linear SVM feature selection is most effective. 3) With carefully coupled feature selection and classification approaches, navigational queries can be accurately identified with 88.1% F1 score, which is 33% error rate reduction compared to the best uncoupled system, and 40% error rate reduction compared to a well tuned system without feature selection -Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous -Dr. Peng contributes to this paper equally as Dr. Lu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are -not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee -
CIKM’06, November 5–11, 2006, Arlington, Virginia, USA
-Copyright 2006 ACM 1-59593-433-2/06/0011 ...$5.00 -General Terms Experimentation -Keywords -Navigational Query Classification, Machine Learning -
-
- - - - -S Beitzel -E Jensen -D Lewis -A Chowdhury -A Kolcz -O Frieder - -Improving Automatic Query Classification via Semi-supervised Learning -2005 -In The Fifth IEEE International Conference on Data Mining -27--30 -New Orleans, Louisiana - -uery locality based on result set features in 2003. Their work, however, lacked carefully designed feature engineering and therefore only achieved a F1 score of 0.52 with a linear SVM. Beitzel, et al.[1] realized the limitation of a single classification method in their query classification problem and proposed a semi-supervised learning method. Their idea is to compose the final classifier by combin - -[1] -S. Beitzel, E. Jensen, D. Lewis, A. Chowdhury, A. Kolcz, and O. Frieder. Improving Automatic Query Classification via Semi-supervised Learning. In The Fifth IEEE International Conference on Data Mining, pages 27–30, New Orleans, Louisiana, November 2005. - - - -C Bhattacharyya -L R Grate -M I Jordan -L El Ghaoui -I S Mian - -Robust Sparse Hyperplane Classifiers: Application to Uncertain Molecular Profiling Data -2004 -Journal of Computational Biology -11 - - of w, in the objective, denotes the inverse margin. Also, it can be viewed as a Gaussian prior of random variable w. Sparse results may be achieved by assuming a laplace prior and using the L-1 norm [2]. Unlike the previous information gain method, the linear SVM normal vector w is not determined by the whole body of training samples. Instead, it is determined by an optimally determined subset, supp - -[2] -C. Bhattacharyya, L. R. Grate, M. I. Jordan, L. El Ghaoui, and I. S. Mian. Robust Sparse Hyperplane Classifiers: Application to Uncertain Molecular Profiling Data. Journal of Computational Biology, 11(6):1073–1089, 2004. - - - -A Broder - -A Taxonomy of Web Search -2002 -In ACM SIGIR Forum -3--10 - -ated work in Section 8. Finally, we conclude the paper in Section 9. 2. PROBLEM DEFINITION We divide queries into two categories: navigational and informational. According to the canonical definition [3, 14], a query is navigational if a user already has a Web-site in mind and the goal is simply to reach that particular site. For example, if a user issues query “amazon”, he/she mainly wants to visit “ama - -[3] -A. Broder. A Taxonomy of Web Search. In ACM SIGIR Forum, pages 3–10, 2002. - - - -S della Pietra -V della Pietra -J Lafferty - -Inducing Features of Random Fields -1995 -IEEE Transactions on Pattern Analysis and Machine Intelligence -19 - -caling finds a classifier of an exponential form that is consistent with the constraints from the labeled data. It can be shown that the maximum entropy distribution is always of the exponential form [4]: where each fi (q; y) is a feature, λi is a parameter to be estimated and Z(q) is simply the normalizing factor to ensure a proper probability: Z(q) = Ey exp(Ei λi f i(q; y)). Learning of the paramet - -[4] -S. della Pietra, V. della Pietra, and J. Lafferty. Inducing Features of Random Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 1995. - - - -R O Duda -P E Hart -D G Stork - -Pattern Classification -2000 -John Wiley -New York, NY, 2nd edition - -op 100 results for each query and each query URL pair has 197 features, in total there are 19,700 features available for each query. Feature reduction becomes necessary due to curse of dimensionality [5]. Before applying feature selection, we conduct a feature integration procedure that merges redundant features. 3.2 Feature Integration We design a feature integration operator, named normalized ratio - -[5] -R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley, New York, NY, 2nd edition, 2000. - - - -J H Friedman - -Stochastic Gradient Boosting -2002 -Computational Statistics and Data Analysis -38 - -tures, due to its discrete feature space partition. Friedman (2002) found that it helps performance by sampling uniformly without replacement from the dataset before estimating the next gradient step [6]. This method was called stochastic gradient boosting. 5. FEATURE SELECTION Many methods have been used in feature selection for text classification, including information gain, mutual information, do -the small sample size. The boosting tree method needs to partition an ultra-high dimensional feature space for feature selection. However, the stochastic step does not have enough data to sample from [6]. Therefore, the boosted result might be biased by earlier sampling and trapped in a local optimum. Support vector machine, however, is able to find an optimally determined subset of training samples, - -[6] -J. H. Friedman. Stochastic Gradient Boosting. Computational Statistics and Data Analysis, 38(4):367–378, 2002. - - - -L Gravano -V Hatzivassiloglou -R Lichtenstein - -Categorizing Web Queries According to Geographical Locality -2003 -In ACM 12th Conference on Information and Knowledge Management (CIKM) -27--30 -New Orleans, Louisiana - -pic based, such as shopping and products. While in navigational query identification, the two classes are intent based. In the classification approaches regard, our work is related to Gravano, et al. [7] where authors applied various classification methods, including linear and nonlinear SVM, decision tree and log-linear regression to classify query locality based on result set features in 2003. Thei - -[7] -L. Gravano, V. Hatzivassiloglou, and R. Lichtenstein. Categorizing Web Queries According to Geographical Locality. In ACM 12th Conference on Information and Knowledge Management (CIKM), pages 27–30, New Orleans, Louisiana, November 2003. - - - -T Hastie -R Tibshirani -J Friedman - -The Elements of Statistical Learning: Data Mining, Inference, and Predication -2001 -Springer Verlag -New York - - we use a linear SVM and a SVM with radial basis kernel. 4.4 Gradient Boosting Tree Like SVM, gradient boosting tree model also seeks a parameterized classifier. It iteratively fits an additive model [8] T ft(x) = Tt(x; Θ0) + λ X t=1 such that certain loss function L(yi, fT(x + i) is minimized, where Tt(x; Θt) is a tree at iteration t, weighted by parameter βt, with a finite number of parameters, Θt - -[8] -T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Predication. Springer Verlag, New York, 2001. - - - -E T Jaynes - -Papers on Probability, Statistics, and Statistical Physics -1983 -D. Reidel, Dordrecht, Holland - -in many natural language processing tasks. The over-riding principle in maximum entropy is that when nothing is known, the distribution should be as uniform as possible, that is, have maximal entropy [9]. Labeled training data are used to derive a set of constraints for the model that characterize the class-specific expectations for the distribution. Constraints are represented as expected values of - -[9] -E. T. Jaynes. Papers on Probability, Statistics, and Statistical Physics. D. Reidel, Dordrecht, Holland and Boston and Hingham, MA, 1983. - - - -T Joachims - -Text Categorization with Support Vector Machines: Learning with Many Relevant Features -1998 -In Proceedings of the 10th European Conference on Machine Learning (ECML) -137--142 -Chemnitz, Germany - -n the number of features, but rather on the number of support vectors (the training examples closest to the hyperplane). This property makes SVMs suitable for high dimensional classification problems [10]. In our experimentation, we use a linear SVM and a SVM with radial basis kernel. 4.4 Gradient Boosting Tree Like SVM, gradient boosting tree model also seeks a parameterized classifier. It iterativel - -[10] -T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML), pages 137–142, Chemnitz, Germany, 1998. - - - -I-H Kang -G Kim - -Query Type Classification for Web Document Retrieval -2003 -In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 64 – 71 -Toronto Canada - -del. 3. MULTI-LEVEL FEATURE EXTRACTION The multiple level feature system is one of the unique features of our system. Unlike prior work with a limited number of features or in a simulated environment [11, 12], our work is based on real search data, a major search engine’s user click information and a query log. In order to handle large amount of heteorgeneous features in an efficient way, we propose a mul -assifiers. Differently, instead of combining two classifiers for prediction, we couple feature selection and classification. In the feature extraction aspect, our work is related to Kang and Kim 2003 [11] where authors extracted heterogenous features to classify user queries into three categories: topic relevance task, the homepage finding task and service finding task. They combined those features, f - -[11] -I.-H. Kang and G. Kim. Query Type Classification for Web Document Retrieval. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 64 – 71, Toronto Canada, July 2003. - - - -U Lee -Z Liu -J Cho - -Automatic Identification of User Goals in Web Search -2005 -In Proceedings of the 14th International World Wide Web Conference (WWW) -Chiba, Japan - -seful when redundant low quality heterogeneous features are encountered. Most previous studies in query identification are based on a small number of features that are obtained from limited resources [12]. In this paper, our third contribution is to explore thousands of available features, extracted from major commercial search engine results, user Web search click data, query log, and the whole Web’s -del. 3. MULTI-LEVEL FEATURE EXTRACTION The multiple level feature system is one of the unique features of our system. Unlike prior work with a limited number of features or in a simulated environment [11, 12], our work is based on real search data, a major search engine’s user click information and a query log. In order to handle large amount of heteorgeneous features in an efficient way, we propose a mul - text is the visible text in a hyperlink, which also provides useful information for navigational query identification. For example, one anchor text feature is the entropy of anchor link distribution [12]. This distribution is basically the histogram of inbound anchor text of the destination URL. If an URL is pointed to by the same anchor texts, the URL is likely to contain perfect content. There are - -[12] -U. Lee, Z. Liu, and J. Cho. Automatic Identification of User Goals in Web Search. In Proceedings of the 14th International World Wide Web Conference (WWW), Chiba, Japan, 2005. - - - -R Malouf - -A Comparison of Algorithms for Maximum Entropy Parameter Estimation -2002 -In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL) -Taipei, China - -roper probability: Z(q) = Ey exp(Ei λi f i(q; y)). Learning of the parameters can be done using generalized iterative scaling (GIS), improved iterative scaling (IIS), or quasi-Newton gradient-climber [13]. 4.3 Support Vector Machine Support Vector Machine (SVM) is one of the most successful discriminative learning methods. It seeks a hyperplane to separate a set of positively and negatively labeled tr - -[13] -R. Malouf. A Comparison of Algorithms for Maximum Entropy Parameter Estimation. In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL), Taipei, China, 2002. - - - -D E Rose -D Levinson - -Understanding User Goals in Web Search -2004 -In Proceedings of The 13th International World Wide Web Conference (WWW) - -ated work in Section 8. Finally, we conclude the paper in Section 9. 2. PROBLEM DEFINITION We divide queries into two categories: navigational and informational. According to the canonical definition [3, 14], a query is navigational if a user already has a Web-site in mind and the goal is simply to reach that particular site. For example, if a user issues query “amazon”, he/she mainly wants to visit “ama - -[14] -D. E. Rose and D. Levinson. Understanding User Goals in Web Search. In Proceedings of The 13th International World Wide Web Conference (WWW), 2004. - - - -D Shen -R Pan -J-T Sun -J J Pan -K Wu -J Yin -Q Yang - -Q2C at UST: Our Winning Solution to Query Classification in KDDCUP -2005 -SIGKDD Explorations -7 - -assification problem and proposed a semi-supervised learning method. Their idea is to compose the final classifier by combining classification results of multiple classification methods. Shen, et al. [15] also trained a linear combination of two classifiers. Differently, instead of combining two classifiers for prediction, we couple feature selection and classification. In the feature extraction aspec - -[15] -D. Shen, R. Pan, J.-T. Sun, J. J. Pan, K. Wu, J. Yin, and Q. Yang. Q2C at UST: Our Winning Solution to Query Classification in KDDCUP 2005. SIGKDD Explorations, 7(2):100–110, 2005. - - - -L Sherman -J Deighton - -Banner advertising: Measuring effectiveness and optimizing placement -2001 -Journal of Interactive Marketing -15 - - a Web-map, and query logs. The click engine is a device to record and analyze user click behavior. It is able to generate hundreds of features automatically based on user click through distributions [16]. A Web-map can be considered as a relational database that stores hundreds of induced features on page content, anchor text, hyperlink structure of webpages, including the inbound, outbound URLs, and - -[16] -L. Sherman and J. Deighton. Banner advertising: Measuring effectiveness and optimizing placement. Journal of Interactive Marketing, 15(2):60–64, 2001. - - - -V Vapnik - -The Nature of Statistical Learning Theory -1995 -Springer Verlag -New York - -f the vector xi. The margin is defined by the distance between the two parallel hyperplanes wT x +b = 1 and wT x + b = —1, i.e. 2/llwll2. The margin is related to the generalization of the classifier [17]. The SVM training problem is defined as follows: minimize (1/2)wT w + γ1T ξ subject to yi(wT xi + b) &gt; 1 — ξi, i = 1, ..., N (7) ξ &gt;0 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 00 20 40 60 80 100 0.4 - -[17] -V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. - - - -Y Yang -J Pedersen - -An Comparison Study on Feature Selection in Text Categorization -1997 -In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in informaion retrieval -Philadelphia, PA, USA - -the very first attempt in this regard. 682 Machine learning models often suffer from the curse of feature dimensionality. Feature selection plays a key role in many tasks, such as text categorization [18]. In this paper, our second contribution is to evaluate several feature selection methods and propose coupling feature selection with classification approaches to achieve the best performance: ranking -ter βt, with a finite number of parameters, Θt and λ is the learning rate. At iteration t, tree Tt(x;β) is induced to fit the negative gradient by least squares. That is statistics. Yang and Pedersen [18] gives a good comparison of these methods. Information gain is one of the most effective methods in the context of text categorization. In addition to information gain, we also use feature selection m -VM’s feature coefficients and stochastic gradient boosting tree’s variable importance. 5.1 Information Gain Information gain is frequently used as a measure of feature goodness in text classification [18]. It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a feature. Let yi : i = 1..m be the set of categories, information gain of a feat - -[18] -Y. Yang and J. Pedersen. An Comparison Study on Feature Selection in Text Categorization. In Proceedings of the 20th annual international ACM SIGIR conference on Research and development in informaion retrieval, Philadelphia, PA, USA, 1997. - - - -S C Zhu - -Statistical modeling and conceptualization of visual patterns -2003 -IEEE Transactions on Pattern Analysis and Machine Intelligence -25 - -pular generative (such as naive Bayes method), descriptive (such as Maximum Entropy method), and discriminative (such as support vector machine and stochastic gradient boosting tree) learning methods [19] to attack the problem. 4.1 Naive Bayes Classifier A simple yet effective learning algorithm for classification l(p) 684 Query: &apos;Walmart&apos; Rank Query: &quot;Canadian gold maple leaf&apos; 0.5 - -[19] -S.C. Zhu. Statistical modeling and conceptualization of visual patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(6):619–712, 2003. - - - -
\ No newline at end of file diff --git a/bin/stderr b/bin/stderr deleted file mode 100644 index 4c8ac0d..0000000 --- a/bin/stderr +++ /dev/null @@ -1,4118 +0,0 @@ -Using -Long -Runs -as -Predictors -of -Semantic -Coherence -in -a -Partial -Document -Retrieval -System -Hyopil -Shin -Computing -Research -Laboratory, -NMSU -PO -Box -30001 -Las -Cruces, -NM, -88003 -hshin@crl.nmsu.edu -Jerrold -F. -Stach -Computer -Science -Telecommunications, -UMKC -5100 -Rockhill -Road -Kansas -City, -MO, -64110 -stach@cstp.umkc.edu -Abstract -We -propose -a -method -for -dealing -with -semantic -complexities -occurring -in -information -retrieval -systems -on -the -basis -of -linguistic -observations. -Our -method -follows -from -an -analysis -indicating -that -long -runs -of -content -words -appear -in -a -stopped -document -cluster, -and -our -observation -that -these -long -runs -predominately -originate -from -the -prepositional -phrase -and -subject -complement -positions -and -as -such, -may -be -useful -predictors -of -semantic -coherence. -From -this -linguistic -basis, -we -test -three -statistical -hypotheses -over -a -small -collection -of -documents -from -different -genre. -By -coordinating -thesaurus -semantic -categories -(SEMCATs) -of -the -long -run -words -to -the -semantic -categories -of -paragraphs, -we -conclude -that -for -paragraphs -containing -both -long -runs -and -short -runs, -the -SEMCAT -weight -of -long -runs -of -content -words -is -a -strong -predictor -of -the -semantic -coherence -of -the -paragraph. -Introduction -One -of -the -fundamental -deficiencies -of -current -information -retrieval -methods -is -that -the -words -searchers -use -to -construct -terms -often -are -not -the -same -as -those -by -which -the -searched -information -has -been -indexed. -There -are -two -components -to -this -problem, -synonymy -and -polysemy -(Deerwester -et. -al., -1990). -By -definition -of -polysemy, -a -document -containing -the -search -terms -or -indexed -with -the -search -terms -is -not -necessarily -relevant. -Polysemy -contributes -heavily -to -poor -precision. -Attempts -to -deal -with -the -synonymy -problem -have -relied -on -intellectual -or -automatic -term -expansion, -or -the -construction -of -a -thesaurus. -Also -the -ambiguity -of -natural -language -causes -semantic -complexities -that -result -in -poor -precision. -Since -queries -are -mostly -formulated -as -words -or -phrases -in -a -language, -and -the -expressions -of -a -language -are -ambiguous -in -many -cases, -the -system -must -have -ways -to -disambiguate -the -query. -In -order -to -resolve -semantic -complexities -in -information -retrieval -systems, -we -designed -a -method -to -incorporate -semantic -information -into -current -IR -systems. -Our -method -( -1 -) -adopts -widely -used -Semantic -Information -or -Categories, -(2) -calculates -Semantic -Weight -based -on -probability, -and -(3) -(for -the -purpose -of -verifying -the -method) -performs -partial -text -retrieval -based -upon -Semantic -Weight -or -Coherence -to -overcome -cognitive -overload -of -the -human -agent. -We -make -two -basic -assumptions: -1. -Matching -search -terms -to -semantic -categories -should -improve -retrieval -precision. -2. -Long -runs -of -content -words -have -a -linguistic -basis -for -Semantic -Weight -and -can -also -be -verified -statistically. -1 -A -Brief -Overview -of -Previous -Approaches -There -have -been -several -attempts -to -deal -with -complexity -using -semantic -information. -These -methods -are -hampered -by -the -lack -of -dictionaries -containing -proper -semantic -categories -for -classifying -text. -Semantic -methods -designed -by -Boyd -et. -al. -(1994) -and -Wendlandt -et. -al. -(1991) -demonstrate -only -simple -examples -and -are -restricted -to -small -numbers -of -words. -In -order -to -overcome -this -6 -deficiency, -we -propose -to -incorporate -the -structural -information -of -the -thesaurus, -semantic -categories -(SEMCATs). -However, -we -must -also -incorporate -semantic -categories -into -current -IR -systems -in -a -compatible -manner. -The -problem -we -deal -with -is -partial -text -retrieval -when -all -the -terms -of -the -traditional -vector -equations -are -not -known. -This -is -the -case -when -retrieval -is -associated -with -a -near -real -time -filter, -or -when -the -size -or -number -of -documents -in -a -corpus -is -unknown. -In -such -cases -we -can -retrieve -only -partial -text, -a -paragraph -or -page. -But -since -there -is -no -document -wide -or -corpus -wide -statistics, -it -is -difficult -to -judge -whether -or -not -the -text -fragment -is -relevant. -The -method -we -employ -in -this -paper -identifies -semantic -"hot -spots" -in -partial -text. -These -"hot -spots" -are -loci -of -semantic -coherence -in -a -paragraph -of -text. -Such -paragraphs -are -likely -to -convey -the -central -ideas -of -the -document, -We -also -deal -with -the -computational -aspects -of -partial -text -retrieval. -We -use -a -simple -stop/stem -method -to -expose -long -runs -of -context -words -that -are -evaluated -relative -to -the -search -terms. -Our -goal -is -not -to -retrieve -a -highly -relevant -sentence, -but -rather -to -retrieve -a -portion -of -text -that -is -semantically -coherent -with -respect -to -the -search -terms. -This -locale -can -be -returned -to -the -searcher -for -evaluation -and -if -it -is -relevant, -the -search -terms -can -be -refined. -This -approach -is -compatible -with -Latent -Semantic -Indexing -(LSI) -for -partial -text -retrieval -when -the -terms -of -the -vector -space -are -not -known. -LSI -is -based -on -a -vector -space -information -retrieval -method -that -has -demonstrated -improved -performance -over -the -traditional -vector -space -techniques. -So -when -incorporating -semantic -information, -it -is -necessary -to -adopt -existing -mathematical -methods -including -probabilistic -methods -and -statistical -methods. -2 -Theoretical -Background -2.1 -Long -Runs -Partial -Information -Retrieval -has -to -with -detection -of -main -ideas. -Main -ideas -are -topic -sentences -that -have -central -meaning -to -the -text. -Our -method -of -detecting -main -idea -paragraphs -extends -from -Jang -(1997) -who -observed -that -after -stemming -and -stopping -a -document, -long -runs -of -content -words -cluster. -Content -word -runs -are -a -sequence -of -content -words -with -a -function -word(s) -prefix -and -suffix. -These -runs -can -be -weighted -for -density -in -a -stopped -document -and -vector -processed. -We -observed -that -these -long -content -word -runs -generally -originate -from -the -prepositional -phrase -and -subject -complement -positions, -providing -a -linguistic -basis -for -a -dense -neighbourhood -of -long -runs -of -content -words -signalling -a -semantic -locus -of -the -writing. -We -suppose -that -these -neighbourhoods -may -contain -main -ideas -of -the -text. -In -order -to -verify -this, -we -designed -a -methodology -to -incorporate -semantic -features -into -information -retrieval -and -examined -long -runs -of -content -words -as -a -semantic -predictor. -We -examined -all -the -long -runs -of -the -Jang -(1997) -collection -and -discovered -most -of -them -originate -from -the -prepositional -phrase -and -subject -complement -positions. -According -to -Halliday -(1985), -a -preposition -is -explained -as -a -minor -verb. -It -functions -as -a -minor -Predicator -having -a -nominal -group -as -its -complement. -Thus -the -internal -structure -of -'across -the -lake' -is -like -that -of -'crossing -the -lake', -with -a -non-finite -verb -as -Predicator -(thus -our -choice -of -3 -words -as -a -long -run). -When -we -interpret -the -preposition -as -a -"minor -Predicator" -and -"minor -Process", -we -are -interpreting -the -prepositional -phrase -as -a -kind -of -minor -clause. -That -is, -prepositional -phrases -function -as -a -clause -and -their -role -is -predication. -Traditionally, -predication -is -what -a -statement -says -about -its -subject. -A -named -predication -corresponds -to -an -externally -defined -function, -namely -what -the -speaker -intends -to -say -his -or -her -subject, -i.e. -their -referent. -If -long -runs -largely -appear -in -predication -positions, -it -would -suggest -that -the -speaker -is -saying -something -important -and -the -longer -runs -of -content -words -would -signal -a -locus -of -the -speaker's -intention. -Extending -from -the -statistical -analysis -of -Jang -(1997) -and -our -observations -of -those -long -runs -in -the -collection, -we -give -a -basic -assumption -of -OUT -study: -Long -runs -of -content -words -contain -significant -semantic -information -that -a -speaker -wants -to -express -and -focus, -and -thus -are -semantic -indicators -or -loci -or -main -ideas. -7 -In -this -paper, -we -examine -the -SEMCAT -values -of -long -and -short -runs, -extracted -from -a -random -document -of -the -collection -in -Jang -(1997), -to -determine -if -the -SEMCAT -weights -of -long -runs -of -content -words -are -semantic -predictors. -2.2 -SEMCATs -We -adopted -Roget's -Thesaurus -for -our -basic -semantic -categories -(SEMCATs). -We -extracted -the -semantic -categories -from -the -online -Thesaurus -for -convenience. -We -employ -the -39 -intermediate -categories -as -basic -semantic -information, -since -the -6 -main -categories -are -too -general, -and -the -many -sub-categories -are -too -narrow -to -be -taken -into -account. -We -refer -to -these -39 -categories -as -SEMCATs. -Table -1: -Semantic -Categories -(SEMCATs) -Abbreviation -Full -Description -1 -AFIG -Affection -in -General -2 -ANT -Antagonism -3 -CAU -Causation -4 -CHN -Change -5 -COIV -Conditional -Intersocial -Volition -6 -CRTH -Creative -Thought -7 -DIM -Dimensions -EXIS -Existence -9 -EXOT -Extension -of -Thought -1° -FORM -Form -11 -GINV -General -Inter -social -Volition -12 -INOM -Inorganic -Matter -13 -MECO -Means -of -Communication -14 -MFRE -Materials -for -Reasoning -15 -MIG -Matter -ingeneral -16 -MOAF -Moral -Affections -17 -MOCO -Modes -of -Communication -18 -MOT -Motion -19 -NOIC -Nature -of -Ideas -Communicated -20 -NUM -Number -21 -opm -Operations -of -Intelligence -In -General -22 -ORD -Order -23 -ORGM -Organic -Matter -24 -pEAF -Personal -Affections -25 -PORE -Possessive -Relations -26 -PRCO -Precursory -Conditions -and -Operations -27 -PRVO -Prospective -Volition -28 -QUAN -Quantity -29 -REAF -Religious -Affections -ao -RELN -Relation -31 -REOR -Reasoning -Organization -32 -REPR -Reasoning -Process -33 -ROVO -Result -of -Voluntary -Action -34 -SIG -Space -in -General -35 -S -IVO -Special -Inter -social -Volition -36 -SYAF -Sympathetic -Affections -37 -TIME -Time -38 -VOAC -Voluntary -Action -39 -VOIG -Volition -in -General -2.3 -Indexing -Space -and -Stop -Lists -Many -of -the -most -frequently -occurring -words -in -English, -such -as -"the," -"of," -"and," -"to," -etc. -are -non-discriminators -with -respect -to -information -filtering. -Since -many -of -these -function -words -make -up -a -large -fraction -of -the -text -of -Most -documents, -their -early -elimination -in -the -indexing -process -speeds -processing, -saves -significant -amounts -of -index -space -and -does -not -compromise -the -filtering -process. -In -the -Brown -Corpus, -the -frequency -of -stop -words -is -551,057 -out -of -1,013,644 -total -words. -Function -words -therefore -account -for -about -54.5% -of -the -tokens -in -a -document. -The -Brown -Corpus -is -useful -in -text -retrieval -because -it -is -small -and -efficiently -exposes -content -word -runs. -Furthermore, -minimizing -the -document -token -size -is -very -important -in -NLPbased -methods, -because -NLP-based -methods -usually -need -much -larger -indexing -spaces -than -statistical-based -methods -due -to -processes -for -tagging -and -parsing. -3 -Experimental -Basis -In -order -to -verify -that -long -runs -contribute -to -resolve -semantic -complexities -and -can -be -used -as -predictors -of -semantic -intent, -we -employed -a -probabilistic, -vector -processing -methodology. -3.1 -Revised -Probability -and -Vector -Processing -In -order -to -understand -the -calculation -of -SEMCATs, -it -is -helpful -to -look -at -the -structure -8 -of -a -preprocessed -document. -One -document -"Barbie" -in -the -Jang -(1997) -collection -has -a -total -of -1,468 -words -comprised -of -755 -content -words -and -713 -function -words. -The -document -has -17 -paragraphs. -Filtering -out -function -words -using -the -Brown -Corpus -exposed -the -runs -of -content -words -as -shown -in -Figure -1. -Figure -1: -Preprocessed -Text -Document -BARBIE -* -* -* -* -FAVORITE -COMPANION -DETRACTORS -LOVE -* -* -* -PLASTIC -PERFECTION -* -FASHION -DOLL -* -* -IMPOSSIBLE -FIGURE -* -LONG -* -* -* -POPULAR -GIRL -* -MA -ITEL -* -WORLD -* -TOYMAKER -* -PRODUCTS -RANGE -* -FISHER -PRICE -INFANT -* -SALES -* -* -* -TALL -MANNEQUIN -* -BARBIE -* -* -AGE -* -* -* -BEST -SELLING -GIRLS -BRAND -* -* -POISED -* -STRUT -* -* -CHANGE -* -* -MALE -DOMINATED -WORLD -* -MULTIMEDIA -SOFTWARE -* -VIDEO -GAMES -In -Figure -1, -asterisks -occupy -positions -where -function -words -were -filtered -out. -The -bold -type -indicates -the -location -of -the -longest -runs -of -content -words. -The -run -length -distribution -of -Figure -1 -is -shown -below: -Table -2: -Distribution -of -Content -Run -Lengths -in -a -sam -le -Document -Run -Length -Frequency -1 -II -2 -8 -3 -2 -4 -2 -The -traditional -vector -processing -model -requires -the -following -set -of -terms: -• -(dl) -the -number -of -documents -in -the -collection -that -each -word -occurs -in -• -(id° -the -inverse -document -frequency -of -each -word -determined -by -logio(N/df) -where -N -is -the -total -number -of -documents. -If -a -word -appears -in -a -query -but -not -in -a -document, -its -idf -is -undefined. -• -The -category -probability -of -each -query -word. -Wendlandt -(1991) -points -out -that -it -is -useful -to -retrieve -a -set -of -documents -based -upon -key -words -only, -and -then -considers -only -those -documents -for -semantic -category -and -attribute -analysis. -Wendlandt -(1991) -appends -the -s -category -weights -to -the -t -term -weights -of -each -document -vector -Di -and -the -Query -vector -Q. -Since -our -basic -query -unit -is -a -paragraph, -document -frequency -(dl) -and -inverse -document -frequency -(idf) -have -to -be -redefined. -As -we -pointed -out -in -Section -1, -all -terms -are -not -known -in -partial -text -retrieval. -Further, -our -approach -is -based -on -semantic -weight -rather -than -word -frequency. -Therefore -any -frequency -based -measures -defined -by -Boyd -et -al. -(1994) -and -Wendlandt -(1991) -need -to -be -built -from -the -probabilities -of -individual -semantic -categories. -Those -modifications -are -described -below. -As -a -simplifying -assumption, -we -assume -SEMCATs -have -a -uniform -probability -distribution -with -regard -to -a -word. -3.2 -Calculating -SEMCATs -Our -first -task -in -computing -SEMCAT -values -was -to -create -a -SEMCAT -dictionary -for -our -method. -We -extracted -SEMCATs -for -every -word -from -the -World -Wide -Web -version -of -Roget's -thesaurus. -SEMCATs -give -probabilities -of -a -word -corresponding -to -a -semantic -category. -The -content -word -run -'favorite -companion -detractors -love' -is -of -length -4. -Each -word -of -the -run -maps -to -at -least -one -SEMCAT. -The -word -`favorite' -maps -to -categories -`PEAF -and -SYAF'. -'companion' -maps -to -categories -'ANT, -MECO, -NUM, -ORD, -ORGM, -PEAF, -PRVO, -QUAN, -and -SYAF'. -'detractor' -maps -to -`MOAF'. -'love' -maps -to -`AFIG, -ANT, -MECO, -MOAF, -MOCO, -ORGM, -PEAF, -PORE, -PRVO, -SYAF, -and -VOIG'. -We -treat -the -long -runs -as -a -semantic -core -from -which -to -calculate -SEMCAT -values. -SEMCAT -weights -are -calculated -based -on -the -following -equations. -Eq.1 -Pik(Probability) -- -The -likelihood -of -SEMCAT -Si -occurring -due -to -the -le -trigger. -For -example, -assuming -a -uniform -probability -distribution, -the -category -PEAF -triggered -by -the -word -favorite -above, -has -the -following -probability: -PPEAF, -favorite -= -0.5(112) -Eq.2 -Sw; -(SEMCAT -Weights -in -Long -runs) -is -the -sum -of -each -SEMCATO -weight -of -long -runs -based -on -their -probabilities. -In -the -above -example, -the -long -run -9 -'favorite -companion -detractors -love,' -the -SEMCAT -`MOAF' -has -SWMOAF -(detractor(1) -love(.09)) -= -1.09. -We -can -write; -SWi -= -I -p,, -Eq.3 -edwj -(Expected -data -weights -in -a -paragraph) -- -Given -a -set -of -N -content -words -(data) -in -a -paragraph, -the -expected -weight -of -the -SEMCATs -of -long -runs -in -a -paragraph -is: -edwj -= -pi; -,=1 -Eq.4 -idwj -(Inverse -data -weights -in -a -paragraph) -- -The -inverse -data -weight -of -SEMCATs -of -long -runs -for -a -set -of -N -content -words -in -a -paragraph -is -N -), -ichvi=logio((- -edwi -Eq.5 -Weight(W) -- -The -weight -of -SEMCAT -Si -in -a -paragraph -is -W; -= -Swjxidw; -Eq.6 -Relevance -Weights -(Semantic -Coherence) -Our -method -performs -the -following -steps: -1. -calculate -the -SEMCAT -weight -of -each -long -content -word -run -in -every -paragraph -(Sw) -2. -calculate -the -expected -data -weight -of -each -paragraph -(edw) -3. -calculate -the -inverse -expected -data -weight -of -each -paragraph -(idw) -4. -calculate -the -actual -weight -of -each -paragraph -(Swxidw) -5. -calculate -coherence -weights -(total -relevance) -by -summing -the -weights -of -(Swxidw). -In -every -paragraph, -extraction -of -SEMCATs -from -long -runs -is -done -first. -The -next -step -is -finding -the -same -SEMCATs -of -long -runs -through -every -word -in -a -paragraph -(expected -data -weight), -then -calculate -idw, -and -finally -Swxidw. -The -final, -total -relevance -weights -are -an -accumulation -of -all -weights -of -SEMCATs -of -content -words -in -a -paragraph. -Total -relevance -tells -how -many -SEMCATs -of -the -Query's -long -runs -appear -in -a -paragraph. -Higher -values -imply -that -the -paragraph -is -relevant -to -the -long -runs -of -the -Query. -The -following -is -a -program -output -for -calculating -SEMCAT -weights -for -an -arbitrary -long -run: -"SEVEN -INTERACTIVE -PRODUCTS -LED" -SEMCAT: -EXOT -Sw -: -1.00 -edw -: -1.99 -idw -: -1.44 -Swxidw -: -1.44 -SEMCAT: -GINV -Sw -: -0.33 -edw -: -1.62 -idw -: -1.53 -Swxidw -: -0.51 -SEMCAT: -MOT -Sw -: -0.20 -edw -: -0.71 -idw -: -1.89 -Swxidw -: -0.38 -SEMCAT: -NUM -Sw -: -0.20 -edw -: -1.76 -idw -: -1.49 -Swxidw -: -0.30 -SEMCAT: -ORGM -Sw -: -0.20 -edw -: -1.67 -idw -1.52 -Swxidw -; -0,30 -SEMCAT: -PEAF -Sw -: -0.53 -edw -: -1.50 -idw -: -1.56 -Swxidw -: -0.83 -SEMCAT: -REAF -Sw -: -0.20 -edw -: -0.20 -idw -: -2.44 -Swxidw -: -0.49 -SEMCAT: -SYAF -Sw -: -0.33 -edw -: -1.19 -idw -: -1.66 -Swxidw -: -0.55 -Total -(Swxidw) -: -4,79 -4 -Experimental -Results -The -goal -of -employing -probability -and -vector -processing -is -to -prove -the -linguistic -basis -that -long -runs -of -content -words -can -be -used -as -predictors -of -semantic -intent -But -we -also -want -to -exploit -the -computational -advantage -of -removing -the -function -words -from -the -document, -which -reduces -the -number -of -tokens -processed -by -about -50% -and -thus -reduces -vector -space -and -probability -computations. -If -it -is -true -that -long -runs -of -content -words -are -predictors -of -semantic -coherence, -we -can -further -reduce -the -complexity -of -vector -computations: -(1) -by -eliminating -those -paragraphs -without -long -runs -from -consideration, -(2) -within -remaining -paragraphs -with -long -runs, -computing -and -summing -the -semantic -coherence -of -the -longest -runs -only, -(3) -ranking -the -eligible -paragraphs -for -retrieval -based -upon -their -semantic -weights -relative -to -the -query. -Jang -(1997) -established -that -the -distribution -of -long -runs -of -content -words -and -short -runs -of -content -words -in -a -collection -of -paragraphs -are -drawn -from -different -populations. -This -implies -10 -that -either -long -runs -or -short -runs -are -predictors, -but -since -all -paragraphs -contain -short -runs, -i.e. -a -single -content -word -separated -by -function -words, -only -long -runs -can -be -useful -predictors. -Furthermore, -only -long -runs -as -we -define -them -can -be -used -as -predictors -because -short -runs -are -insufficient -to -construct -the -language -constructs -for -prepositional -phrase -and -subject -complement -positions. -If -short -runs -were -discriminators, -the -linguistic -assumption -of -this -research -would -be -violated. -The -statistical -analysis -of -Jang -(1997) -does -not -indicate -this -to -be -the -case. -To -proceed -in -establishing -the -viability -of -our -approach, -we -proposed -the -following -experimental -hypotheses: -(111) -The -SEMCAT -weights -for -long -runs -of -content -words -are -statistically -greater -than -weights -for -short -runs -of -content -words. -Since -each -content -word -can -map -to -multiple -SEMCATs, -we -cannot -assume -that -the -semantic -weight -of -a -long -run -is -a -function -of -its -length. -The -semantic -coherence -of -long -runs -should -be -a -more -granular -discriminator. -(112) -For -paragraphs -containing -long -runs -and -short -runs, -the -distribution -of -long -run -SEMCAT -weights -is -statistically -different -from -the -distribution -of -short -run -SEMCAT -weights. -(H3) -There -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -semantic -coherence -of -a -paragraph, -the -total -paragraph -SEMCAT -weight. -A -detailed -description -of -these -experiments -and -their -outcome -are -described -in -Shin -(1997, -1999). -The -results -of -the -experiments -and -the -implications -of -those -results -relative -to -the -method -we -propose -are -discussed -below. -Table -3 -gives -the -SEMCAT -weights -for -seventeen -paragraphs -randomly -chosen -from -one -document -in -the -collection -of -Jang -(1997). -Table -3: -SEMCAT -Weights -of -17 -Paragraphs -Chosen -Randomly -From -a -Collection -Paragraph -Short -Runs -Long -Runs -Weight -Weight -1 -29.84 -18.60 -2 -31.29 -12.81 -3 -23.29 -4.25 -4 -23.94 -11.63 -5 -34.63 -35.00 -6 -22.85 -03.32 -7 -21.74 -00.00 -8 -35.84 -15.94 -9 -30.15 -00.00 -10 -13.40 -00.00 -11 -23.01 -07.82 -12 -31.69 -04.79 -13 -36.54 -00.00 -14 -17.91 -10.55 -15 -19.70 -05.83 -16 -17.11 -00.00 -17 -31.86 -00.00 -The -data -was -evaluated -using -a -standard -two -way -F -test -and -analysis -of -variance -table -with -ot -= -.05. -The -analysis -of -variance -table -for -the -paragraphs -in -Table -3 -is -shown -in -Table -4. -Table -4: -Analysis -of -Variance -for -Table -2 -Data -Variation -Degrees -of -Mean -Square -F -Freedom -Between -1 -2904.51 -68.56 -Treatments -V, -= -2904.51 -Between -Blocks -16 -93.92 -2.21 -yr -= -1502.83 -Residual -or -16 -42.36 -Random -V,= -677.77 -Total -33 -V -= -5085.11 -At -the -.05 -significance -level, -Fa -05 -= -4.49 -for -1,16 -degrees -of -freedom. -Since -68.56 -> -4.49 -we -reject -the -assertion -that -column -means -(run -weights) -are -equal -in -Table -2. -Long -run -and -short -run -weights -come -from -different -populations. -We -accept -Hl. -For -the -between -paragraph -treatment, -the -row -means -(paragraph -weights) -have -an -F -value -of -2.21. -At -the -.05 -significance -level, -Fa -. -05 -= -2.28 -for -16,16 -degrees -of -freedom. -Since -2.21 -< -2.28 -we -cannot -reject -the -assertion -that -there -is -no -significant -difference -in -SEMCAT -weights -between -paragraphs. -That -is, -paragraph -weights -do -not -appear -to -be -taken -from -different -populations, -as -do -the -long -run -and -short -run -weight -distributions. -Thus, -the -semantic -weight -11 -of -the -content -words -in -a -paragraph -cannot -be -used -to -predict -the -semantic -weight -of -the -paragraph. -We -therefore -proceed -to -examine -H2. -Notice -that -two -paragraphs -in -Table -2 -are -without -long -runs. -We -need -to -repeat -the -analysis -of -variance -for -only -those -paragraphs -with -long -runs -to -see -if -long -runs -are -discriminators. -Table -5 -summarizes -those -paragraphs. -Table -5: -SEMCAT -weights -of -11 -paragraphs -containing -Ion -runs -and -short -runs -Paragraph -Short -Runs -Long -Runs -Weight -Weight -1 -29.84 -18.60 -2 -31.29 -12.81 -3 -23.29 -4.25 -4 -23.94 -11,63 -5 -34.63 -35.00 -6 -22.85 -03.32 -8 -35.84 -15.94 -11 -23.01 -07.82 -12 -31.69 -04.79 -14 -17.91 -10.55 -15 -19.70 -05.83 -This -data -was -evaluated -using -a -standard -two -way -F -test -and -analysis -of -variance -with -a -= -.05. -The -analysis -of -variance -table -for -the -paragraphs -in -Table -5 -follows. -Table -6: -Analysis -of -Variance -for -Table -5 -Data -Variation -._ -Mean -Square -F -Degrees -of -Freedom -Between -Treatments -1 -1430.98 -291.44 -V= -1430.98 -Between -Blocks -10 -94.40 -19.22 -V= -944.05 -Residual -or -10 -4.91 -Random -V,...- -49.19 -Total -21 -V -= -2424.26 -At -the -.05 -significance -level, -F. -.05 -= -4.10 -for -2,10 -degrees -of -freedom. -4.10 -< -291.44. -At -the -.05 -significance -level, -F. -= -2.98 -for -10,10 -degrees -of -freedom. -2.98 -< -19.22. -For -paragraphs -in -a -collection -containing -both -long -and -short -runs: -the -SEMCAT -weights -of -the -long -runs -and -short -runs -are -drawn -from -different -distributions. -We -accept -112. -For -paragraphs -containing -long -runs -and -short -runs, -the -distributions -of -long -run -SEMCAT -weights -is -different -from -the -distribution -of -short -run -SEMCAT -weights. -We -know -from -the -linguistic -basis -for -long -runs -that -short -runs -cannot -be -used -as -predictors. -We -therefore -proceed -to -examine -the -Pearson -correlation -between -the -long -run -SEMCAT -weights -and -paragraph -SEMCAT -weights -for -those -paragraphs -with -both -long -and -short -content -word -runs. -Table -7: -Correlation -of -Long -Run -SEMCAT -Wei -hts -to -Para -ra -h -SEMCAT -Weight -Paragraph -Long -Runs -Semantic -Weight -Paragraph -Semantic -Weight -1 -18.60 -48.44 -2 -12.81 -44.10 -3 -4.25 -27.54 -4 -11.63 -35.57 -5 -35.00 -69.63 -6 -03.32 -26.17 -8 -15.94 -51.78 -11 -07.82 -30.83 -12 --04.79 -31.69 -14 -10.55 -28.46 -15 -05.83 -25.53 -The -weights -in -Table -have -a -positive -Pearson -Product -Correlation -coefficient -of -.952. -We -therefore -accept -1-13. -There -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -semantic -coherence -of -a -paragraph, -the -total -paragraph -SEMCAT -weight. -5. -Conclusion -This -research -tested -three -statistical -hypotheses -extending -from -two -observations: -(1) -fang -(1997) -observed -the -clustering -of -long -runs -of -content -words -and -established -the -distribution -of -long -run -lengths -and -short -run -lengths -are -drawn -from -different -populations, -(2) -our -observation -that -these -long -runs -of -content -words -originate -from -the -prepositional -phrase -and -subject -complement -positions. -According -to -Halliday -(1985) -those -grammar -structures -function -as -12 -minor -predication -and -as -such -are -loci -of -semantic -intent -or -coherence. -In -order -to -facilitate -the -use -of -long -runs -as -predictors, -we -modified -the -traditional -measures -of -Boyd -et -al. -(1994), -Wendlandt -(1991) -to -accommodate -semantic -categories -and -partial -text -retrieval. -The -revised -metrics -and -the -computational -method -we -propose -were -used -in -the -statistical -experiments -presented -above. -The -main -findings -of -this -work -are -1. -the -distribution -semantic -coherence -(SEMCAT -weights) -of -long -runs -is -not -statistically -greater -than -that -of -short -runs, -2. -for -paragraphs -containing -both -long -runs -and -short -runs, -the -SEMCAT -weight -distributions -are -drawn -from -different -populations -3. -there -is -a -positive -correlation -between -the -sum -of -long -run -SEMCAT -weights -and -the -total -SEMCAT -weight -of -the -paragraph -(its -semantic -coherence). -Significant -additional -work -is -required -to -validate -these -preliminary -results. -The -collection -employed -in -Jang -(1997) -is -not -a -standard -Corpus -so -we -have -no -way -to -test -precision -and -relevance -of -the -proposed -method. -The -results -of -the -proposed -method -are -subject -to -the -accuracy -of -the -stop -lists -and -filtering -function. -Nonetheless, -we -feel -the -approach -proposed -has -potential -to -improve -performance -through -reduced -token -processing -and -increased -relevance -through -consideration -of -semantic -coherence -of -long -runs. -Significantly, -our -approach -does -not -require -knowledge -of -the -collection. -References -#2029 4 -Here1 -#3894 4 -Here1 - --> Boyd (1994) - --> Boyd (1994), -#974 2 -Here1 -#3857 2 -Here1 - --> Halliday (1985), - --> Halliday (1985) -#822 2 -Here1 -#955 2 -Here1 -#1137 2 -Here1 -#1667 2 -Here1 -#2798 2 -Here1 - --> Jang (1997) - --> Jang (1997) - --> Jang (1997) - --> Jang (1997) - --> Jang (1997) -#3058 2 -Here1 - --> Shin (1997, -#3059 2 -Here1 - --> (1997, 1999). diff --git a/bin/test.body b/bin/test.body deleted file mode 100644 index beae8b7..0000000 --- a/bin/test.body +++ /dev/null @@ -1,887 +0,0 @@ -From Freedom to Liberty: The Construction of a Political Value -Williams, Bernard Arthur Owen. -Philosophy & Public Affairs, Volume 30, Number 1, Winter 2001, pp. 3-26 (Article) -Published by Princeton University Press -DOI: 10.1353/pap.2001.0015 -For additional information about this article -http://muse.jhu.edu/journals/pap/summary/v030/30.1williams.html -Access Provided by Cambridge University Library at 06/03/10 11:08AM GMT -From Freedom to Liberty: -BERNARD WILLIAMS -The Construction of a -Political Value -I. INTRODUCTION -My subject is freedom and in particular freedom as a political value. Many -discussions of this topic consist of trying to deWne the idea of freedom, -or various ideas of freedom. I do not think that we should be interested -in deWnitions. I leave aside the very general philosophical point that if -we mean, seriously, deWnitions, there are no very interesting deWnitions -of anything. There is a more particular reason. In the case of ethical and -political ideas, what puzzles and concerns us is the understanding of -those ideas—in the present case, freedom—as a value for us in our world. -I do not mean that we are interested in it only as it Wgures in precisely -our set of values—meaning by that, those of a liberal democratic society. -Manifestly it is equally part of our world that such ideas are also used by -those who do not share our values or only partly share them—those with -whom we are in confrontation, discussion, negotiation, or competition, -with whom in general we share the world. Indeed, we will disagree among -ourselves about freedom within our own society. We experience conXicts -between freedom and other values, and—a point I shall emphasize—we -understand some desirable measures as involving a cost in freedom. -Whatever our various relations may be with others in our world who -do or do not share our conception of freedom, we will not understand -our own speciWc relations to that value unless we understand what we -want that value to do for us—what we, now, need it to be in shaping our -An earlier version of this paper was given as the Dewey Lecture at the University of Chi- -cago Law School, April 2001. -© 2001 by Princeton University Press. Philosophy & Public Affairs 30, no. 1 -Philosophy & Public AVairs -4 -own institutions and practices, in disagreeing with those who want to -shape them diVerently, and in understanding and trying to co-exist with -those who live under other institutions. -In all their occurrences, these various conceptions or understandings -of freedom, including the ones we immediately need for ourselves, in- -volve a complex historical deposit, and we will not understand them -unless we grasp something of that deposit, of what the idea of freedom, -in these various connections, has become. This contingent historical -deposit, which makes freedom what it now is, cannot be contained in or -anticipated by anything that could be called a deWnition. It is the same -here as it is with other values: philosophy, or as we might say a priori -anthropology, can construct a core or skeleton or basic structure for the -value, but both what it has variously become, and what we now need it -to be, must be a function of actual history. In the light of this, we can say -that our aim is not to deWne but to construct a conception of freedom. -I shall not attempt a general account of what might count as construct- -ing one or another conception of freedom. One might say that the no- -tion of construction applies at diVerent levels. We need to construct a -value of freedom speciWcally for us; and we need a more generic con- -struction or plan of freedom which helps us to place other conceptions -of it in a philosophical and historical space—which shows us, one might -say, how other speciWc conceptions might be constructed in their own -right. Some of the questions raised by these requirements would simply -be a matter of terminology, of how we might use the term ‘construction.’ -But there is a more signiWcant consideration which links these two lev- -els. The conception of freedom we need for ourselves is both historically -self-conscious and suitable to a modern society—and those two features -are of course related to one another. Because of this, our own speciWc -and active conception of freedom, the one we need for our practical pur- -poses, will contain implicitly the materials for a reXective understand- -ing of the more general possibilities of construction. -However, it is just as important that the disputes that have circled -around the various deWnitions and concepts of liberty do not just repre- -sent a set of verbal misunderstandings. They have been disagreements -about something. There is even a sense in which they have been dis- -agreements about some one thing. There must be a core, or a primitive -conception, perhaps some universal or widely spread human experience, -to which these various conceptions relate. This does not provide, as it -From Freedom to Liberty: -5 -The Construction of a -Political Value -were, the ultimate deWnition. Indeed, this core or primitive item, I am -going to suggest, is certainly not a political value, and perhaps not a value -at all. But it can, and must, explain how these various accounts of the -value of freedom are elaborations of the same thing, that these various -interpretations are not just talking past each other. -There is another consideration which the familiar philosophical dis- -putes and attempts at deWnition indeed take for granted, but they do not -give the right weight to it. In the sense that concerns these discussions, -freedom is a political value. (They are not addressing, for instance, meta- -physical questions about the freedom of the will.) I am going to suggest -that this point itself, when it is properly understood, has a very signiW- -cant eVect on the kind of construction we should be trying to achieve. In -particular, we must take seriously the point that because it is a political -value, the most important disagreements that surround it are political -disagreements. What kinds or registers of politics are involved, what the -relevant understanding of politics will be, will depend on which disagree- -ments are at issue—those within our own society, for instance, or those -with other societies. But our overall construction of freedom as a politi- -cal value must allow the fact that it is a political value to be central and -intelligible. -I am certainly not going to oVer a deWnition or any general character- -ization of the political. That would once again be impossible. But it may -be helpful to mention now four things I believe to be true about the po- -litical, which will shape the discussion and aVect my overall construc- -tion of freedom as a political value. -(a) First, a point about philosophy: political philosophy is not just ap- -plied moral philosophy, which is what in our culture it is often taken to -be.1 Nor is it just a branch of legal philosophy, a point that will concern us -later. In particular, political philosophy must use distinctively political -concepts, such as power, and its normative relative, legitimation. -Philosophy & Public AVairs -6 -(b) The idea of the political is to an important degree focused in the -idea of political disagreement; and political disagreement is signiWcantly -diVerent from moral disagreement. Moral disagreement is characterized -by a class of considerations, by the kinds of reasons that are brought to -bear on a decision. Political disagreement is identiWed by a Weld of appli- -cation—eventually, about what should be done under political author- -ity, in particular through the deployment of state power. The reasons that -go into political decisions and arguments that bear on them may be of -very various kinds. Because of this, political disagreement is not merely -moral disagreement, and it need not necessarily involve it, though it may -do so; equally, it need not necessarily be a disagreement simply of inter- -ests, though of course it may be. -(c) Possible political disagreements include disagreements about the -interpretation of political values, such as freedom, equality, or justice. -These disagreements may involve many diVerent kinds of understand- -ing and political traditions; they can tap into various areas of what I called -the historical deposit. It follows that the relation of these values to each -other cannot be established on the model of interpreting a constitution, -where questions typically take the form of determining what counts as, -say, limiting the freedom of speech. Of course, there is such an activity, -and it plays an important part in some cultures, such as that of the United -States. But even in those cases, it would be a mistake to equate political -thought about questions of principle with thought about actual or ideal -constitutional interpretation.2 We and our political opponents—even our -opponents in one polity, let alone those in others—are not just trying to -read one text. This will be an important point in what follows. -(d) The last of these preliminary signals is provided by that word “op- -ponents.” Carl Schmitt famously said that the fundamental political re- -lation was that of friend and enemy.3 This is an ambiguous remark, and -it can take on a rather sinister tone granted the history of Schmitt’s own -relations to the Weimar Republic and eventually to the Third Reich. But -it is basically true in at least this sense, that political diVerence is of the -From Freedom to Liberty: -7 -The Construction of a -Political Value -essence of politics, and political diVerence is a relation of political oppo- -sition, rather than, in itself, a relation of intellectual or interpretative dis- -agreement. Many things can be covered by the idea of “opposition” it- -self. But they all bring with them the question of how we understand our -opponents, how far our opposition is a matter of interests, how far a -matter of principle, what sentiments are engaged, why we and they feel -so strongly about it if we do, and in what ways we each diVerently tap -into the historical deposit. We may for various reasons think that our -opponents are, among other things, in intellectual error, but the rela- -tions of political opposition cannot simply be understood in terms of -intellectual error. Our construction of freedom as a political value must -make sense of the fact that disagreements involving that value are typi- -cally matters of political opposition, and that this carries substantial -implications about the ways in which we should regard the disagreement, -and regard our opponents themselves. -II. PRIMITIVE FREEDOM -Some of the arguments I shall consider are, inevitably, very familiar. My -excuse for putting on parade some of the usual suspects from Political -Philosophy 101 is rather like that which Descartes oVered when he ex- -cused himself for “warming up the stale cabbage” of ancient skeptical -arguments.4 He admitted that the materials were very familiar, but he -thought that it made all the diVerence what you wanted to do with them. -They had to serve a particular method, and he wanted to illustrate that -method. More modestly, my aim is the same: the usual suspects have to -be put to work, but on a rather diVerent task. -Mill, in Chapter 5 of On Liberty, says, informally enough: “liberty con- -sists in doing what one desires.” He cannot quite mean this: he must at -any rate mean the capacity to do what one desires (you are not unfree if -you simply choose not to do something you desire.) Amended in this -way, Mill agrees with Locke: “Liberty, ‘tis plain, consists in a power to do -or not to do; to do or forbear doing as we will. This cannot be denied.”5 -This is an idea of liberty as ability or capacity. It has an obvious disad- -vantage: we already have a concept of ability or capacity, and on this -Philosophy & Public AVairs -8 -showing ‘liberty’ or ‘freedom’ turn out boringly just to be other names -for it. More importantly, it misses the point of why we want these terms -in the Wrst place. That point is registered for the Wrst time when we add to -this kind of account a further condition, which concerns the kind of ob- -stacle that is stopping us from doing something we want to do. We say, -more narrowly, that we are unfree if our inability is the product, speciW- -cally, of coercion, where that is taken, at least in the central cases, to -mean—using the term ‘coercion’ in a broad sense—the intentional ob- -structive activities of other people. This is incorporated in Isaiah Berlin’s -famous account of “negative” liberty, and of course, as he noted, it goes -back a long way.6 Berlin quotes, for one, Helvétius: “The free man is the -man who is not in irons, nor imprisoned in a gaol, nor terrorized like a -slave by the fear of punishment … it is not lack of freedom, not to Xy like -an eagle or swim like a whale.” Though I shall be concerned with what -Berlin called “negative freedom,” I shall not use that term nor discuss the -distinction between “negative” and “positive” freedom itself. (It is mis- -leading in several respects, particularly if it is identiWed, as it is some- -times by Berlin, with a distinction between “freedom from” and “free- -dom to”.)7 The simple idea of being unobstructed in doing what you want -by some form of humanly imposed coercion, I shall call “primitive free- -dom.” -The range of obstacles, those identiWed with “coercion,” can itself be -interpreted more or less broadly. Some candidates, ordered roughly from -the obvious and agreed to the more disputable, are: -(A) Prevention by force (Helvétius’s irons and gaol); -(B) Threats of force, penalties, social rejection, and so forth -(Helvétius’s fear of punishment);8 -(C) Competition in (something like) a zero-sum game, where one -competitor sets out to stop another reaching his goal; -From Freedom to Liberty: -9 -The Construction of a -Political Value -(D) By-products of another enterprise, not aimed at the agent; -(E) By-products of an arrangement which structurally disadvantages -(those in the position of) the agent. -Some of these variations will concern us later. There is an obvious di- -vision in the list, between cases in which an agent’s activities are deliber- -ately directed against another agent’s capacity to do something, and those -in which they merely bring about that the agent loses that capacity. There -is a further extension beyond this, where what is in question is someone’s -omission or failure to remove an obstacle to the other agent’s capacity. -However, this requires more background, in particular the political frame- -work at which we shall eventually arrive, to make it reasonable to say -that the person in question has “failed” or “omitted” to do something -about this obstacle—that is to say, that this person should do something -about it. The more it can be said that there is a person or agency in this -position, the wider the range of complaints in freedom may be. How- -ever, we should not conclude from this that we should drop the refer- -ence to coercive or limiting action altogether and revert to the concep- -tion of freedom as simply power or capacity.9 We shall come back shortly -to the basic question of why this restriction to obstacles that are intended -by other agents, or created by them, or at the very least not removed by -them, should be so signiWcant. -III. FREEDOM AS A RATIO CONCEPT -First, however, there is a diVerent point to be made about primitive free- -dom. Primitive freedom is a ratio concept: it is a matter of the ratio be- -tween what people desire to do and what they are prevented by others -from doing. This implies that there are two ways to increase people’s free- -dom. I may remove the forces or obstacles that prevent them from satis- -fying their desires. But equally I may bring it about that they do not have -desires that cannot be satisWed. This leads to a paradox. Suppose, im- -plausibly and for the sake of argument, that there were a body of entirely -contented slaves. They are not physically abused, and they do not want -to do any of the things their slavery prevents them from doing. Under -this concept of freedom, they are free. If reformers appear and tell them -Philosophy & Public AVairs -10 -what they are missing and make them for the Wrst time discontented, it -might even be said that it is the reformers who have taken away their -freedom. A concept of freedom that leads to this cannot be adequate. -One reaction to this is to say that freedom should be measured not in -terms of what people actually desire, but in terms of what they should -reasonably, properly, or appropriately desire. This idea can take various -forms. It can also be applied not just to a deWcit of appropriate desire, as -in the case of the slaves, but to an excess of inappropriate desire, as in- -deed it has been by moralists in the Stoic tradition. The construction of -freedom as a political value should certainly leave room for arguments -of this form: besides the familiar answer to a complaint in freedom, that -the constraint on desire is necessary (for instance in the interests of oth- -ers), there is a possible answer in some cases that the desire is unreason- -able and the agent would be better oV without it. In particular, he would -be more free. But as a general principle of argument, this runs the risk of -heading in the direction of what Berlin called “positive freedom”: at the -limit, the argument will be heard that coercive force can be justiWed to -prevent the formation of inappropriate desires or to encourage the for- -mation of appropriate ones, so that people, as Rousseau put it, can be -forced to be free. That notorious phrase has rightly been seen as para- -doxical.10 What is true, though, is that this kind of idea is not simply an -arbitrary appropriation of the word “freedom”—it is rooted in certain -features of the concept, although it develops them in an irresponsible -way.11 -There is another way of dealing with the ratio paradox, which appeals -not to a normatively approved list of desires, but rather to some special -explanations of why people do not have certain desires they might be -expected to have. So in the slave case, the absence of a desire for free- -dom may diagnosed as itself a product of coercion: it is precisely be- -cause of the way in which they are treated, prevented from hearing of -other options and so on, that the state of their desires is as it is. The idea -of this is the same as that employed in the Critical Theory test for beliefs -which supposedly legitimate some prima facie oppressive institution: -From Freedom to Liberty: -11 -The Construction of a -Political Value -whether the belief is the product of the coercion which it supposedly -helps to legitimate. The principle of these tests seems entirely sound and -to Xow naturally from the structure of the idea of coercion; the problem -with them is of course going to lie in the prospects of making good an -interpretation in these terms in any given case. We shall come back to -the happy slaves later, and try to Wx rather more deWnitely where the Criti- -cal Theory test Wts into the construction of liberty. -IV. WHY COERCION? -Why should we pick on, speciWcally, primitive freedom, with its concen- -tration on human sources of constraint, as the starting point? The an- -swer is that primitive freedom is, as we might put it, a “proto-political” -concept. This does not merely mean that if we are interested in freedom -as a political value (as we are), this is the place to start. It means some- -thing stronger: that this is the place to start because it involves a quite -basic human phenomenon, and that phenomenon already points in the -direction of politics. -In a frequently quoted remark, Heracleitus said “They would not have -known the name of justice, if it had not been for these things,” and it is -virtually certain that “these things” are disputes, quarrels, and conXict.12 -Justice, hence an authoritative source of justice, hence an empowered -enforcer of justice, is needed to impose solutions on what would other- -wise be unbounded conXicts of interest. Similarly, the restriction of our -activities by the intentional activities of others, as contrasted with the -ubiquitous limitations we face in nature, can give rise to a quite speciWc -reaction, resentment; and if resentment is not to express itself in more -conXict, non-cooperation, and dissolution of social relations, an authori- -tative determination is needed of whose activities should have priority -(needless to say, that determination itself may well use concepts of jus- -tice.) In an appropriate context, resentment can be directed to inaction, -to a refusal to remove some obstacle if it can be claimed that it is the -other party’s business to remove it. But it cannot extend to what are rec- -ognized as blankly the obstacles of nature. Rousseau’s distinction be- -tween being conWned in one’s house by a snowstorm and being locked -in it by someone else remains in place.13 -Philosophy & Public AVairs -12 -But now there is a further development peculiarly connected with free- -dom. As soon as the authoritative source is indeed empowered and de- -ploys coercion to enforce its rulings, that coercion itself can give rise to -resentment. Questions arise of how that power is being used, questions -that demand legitimating accounts. Those questions are likely to become -more pressing, the closer the situation comes to that in which the au- -thority uniquely commands the means of some kinds of coercion (such -as (A) above, and to some extent (B))—that is to say, the closer it comes -to the ideal type of there being a state. To various degrees in diVerent -societies, these questions will be the subject of discussion. The political, -in some of its many forms, now exists. -V. TOWARDS LIBERTY -We do not yet have freedom as a political value: a political value which -from now on, making a distinction I have not used up to now, I shall call -liberty. -Primitive freedom is not itself that political value.14 We can see this by -considering an idea which arises as soon as we have the conditions of -the political, that is to say, an authority, together with appeals to that -authority. This is the idea of a claim in liberty. The following points are -obvious: -(a) No one can intelligibly make a claim against others simply on -the ground that the activities of those others restrict his primitive free- -dom, or that the extension of his primitive freedom requires action by -them. At best, that is the start of a quarrel, not a claim to its solution. -(b) Similarly, no sane person can expect that his primitive freedom -merely as such should be protected. -(c) Equally, suppose that someone uses the notion of a right: no sane -person can think that he has a right against others to what is demanded -by his primitive freedom as such (i.e., to anything he happens to want.) -(d) A similar point can be made in terms of the good: no one can -intelligibly think that it is good (period, as opposed to good for him) that -his primitive freedom should be unlimited. -The eVect of these points is that the resolution of questions of how far -a person’s freedom should be protected or extended, how far it is good -From Freedom to Liberty: -13 -The Construction of a -Political Value -that it should be, how far he has a right that it should be, requires some -degree of impartiality (a general point of view, in Hume’s phrase) which -is not contained in the idea of an individual’s primitive freedom as such. -The importance of these points has been emphasized by Ronald -Dworkin.15 However, he assumes that a claim in liberty must be a claim -to a speciWc kind of right to do the thing in question, such as a right of -free speech. He concludes from this that there can be no conXicts be- -tween liberty, properly understood, and any rightful claim. For suppose -some other value, such as equality or more generally justice, when prop- -erly interpreted, requires that I not do a certain thing. Then I have no -right to do that thing. So I cannot correctly make a claim in liberty to do -it, and so, if I am prevented from doing it, this cannot be a restriction on -my liberty (though it is of course a restriction of primitive freedom.) -It cannot be necessary that this conclusion should follow from the -understanding of liberty. Indeed, in my view, it is necessary that it should -not follow. We are constructing liberty as a political value, which means -among other things that we can make sense of its role in political argu- -ment and political conXict, and generally of the experience of life under -a political order. It is one datum of that experience that people can even -recognize a restriction as rightful under some political value such as -equality or justice, and nevertheless regard it as a restriction on liberty. -The notion of a cost in liberty is at least as well entrenched in historical -and contemporary experience as that of a rightful claim in liberty. -This notion of a cost in liberty can apply, I just suggested, even to -people who agree with some restrictive measure, introduced for instance -in the interests of equality—they can still regard it as a restriction on lib- -erty, though a justiWed one. Dworkin’s view cannot make sense of the -attitude of such people: on his view, they are merely confused. But the -point about a cost in liberty applies even more signiWcantly to those who -do not agree that the cost is necessary. The state enacts, by quite proper -process, some measure in the name of equality, say, which restricts the -activities of some people. Those people oppose it, and let us suppose -that they oppose it on principle: they do not accept the ideal of equality, -or this application of it, or this way of going about it. They certainly re- -Philosophy & Public AVairs -14 -gard the measure as a restriction on their liberty. Dworkin’s view can in -its own terms give a coherent account of this reaction (they do not think -the measure is rightful), but it now raises a diVerent question: suppose -we are supporters of the measure, what attitude should we take towards -the people who have this reaction, our political opponents? Since we think -that they are wrong in opposing the measure, speciWcally in denying that -the measure is justiWed in the name of equality, we must suppose, on -Dworkin’s view, that they are wrong in thinking that their liberty is being -restricted. They are coerced by the state, they resent it, they vividly think -that their liberty is curtailed. Dworkin patiently explains to them that -they are simply wrong in thinking this; they may think that there is a cost -in liberty, their liberty, but there is not. This is exactly the attitude that -Rousseau thought appropriate, and it seems to me just as objectionable -now as it was with him. -We should take seriously the idea that if, under certain conditions, -people think that there is a cost in liberty, then there is. Taking that idea -seriously, I suggest, is a condition not only of taking seriously the idea of -political opposition, but of taking our political opponents themselves -seriously. -There is one class of complainants about costs in liberty whom, I think, -we need not take seriously: those who complain that their liberty, or in- -deed their primitive freedom, is curtailed by the mere existence of a state. -Certainly not their liberty: since liberty is freedom as a political value, no -complaint is a complaint in liberty if it would apply to any political sys- -tem or any state whatsoever, so the existence of the state is not itself an -oVense against or limitation on liberty (though some particular forms of -the state may of course readily be so.) Moreover, this is not simply a ver- -bal point about the understanding of “liberty”; we need not agree, either, -that the fact that a person is subject to a state is, in itself, a limitation on -his primitive freedom. The reason for this is that the amount of freedom -that a person would have without the state is entirely indeterminate or, -at any rate, very small. Two conclusions follow about anarchism: from -the point about liberty it follows that is not a political position, and from -the point about primitive freedom, that it is not interesting, and I hap- -pily accept both these conclusions. -From Freedom to Liberty: -15 -The Construction of a -Political Value -VI. BEYOND CLAIMS IN LIBERTY -The Rousseau outlook (as we might call it) fails to make sense of an en- -tirely familiar reaction that is basic to politics and to the understanding -of political opposition. For that reason, it does not encourage a helpful— -one might say, healthy—relation to one’s opponents. What we should take -seriously are their reactions, or at least their deeper reactions, rather than -the extent to which we are disposed to share or morally approve of their -reactions, and this applies in diVerent forms whether they are opponents -outside our polity or opponents within it. There is a potentially instruc- -tional, potentially patronizing, element in the Rousseau outlook which, -to take just the case of local opponents, is hostile to the relations of fel- -low citizenship which we must hope can co-exist with political opposi- -tion — so long at least that we believe that there should be one polity and -political opposition has not irreparably divided it. Indeed, this moral- -ized outlook in some of its more spectacular historical expressions, such -as the Terror, has shown that it can destroy not just citizenship but citi- -zens. -The philosophical fault at the heart of this outlook might be said to be -this, that it bases the idea of liberty on that of a rightful claim in liberty. -The notion of a claim in liberty, I have said, is useful in distinguishing -liberty from primitive freedom in the Wrst place. It can do this because -any adequate idea of liberty must at any rate accommodate the idea of a -claim in liberty, and the idea of primitive freedom, in itself, cannot do so -at all. But the idea of a rightful claim in liberty implies a juridical concep- -tion, of an agreed authority which can rightfully grant or refuse such a -claim, and political opponents do not necessarily understand their situ- -ation in these terms. As I put it earlier, they are not all interpreting the -same text. -In the case of opponents in diVerent political systems, they may not -agree on terms in which such an authority, if they imagined it to exist, -might legitimate its decisions to them. Between opponents who share a -polity and neither of whom wants to destroy it, they will agree on an au- -thority or process which decides what will happen, but this is not at all -equivalent to the authority’s deciding that one or another claim in lib- -Philosophy & Public AVairs -16 -erty is rightful. The reason for this lies in a characteristic of the political -that I mentioned before, that political disagreements are not identiWed -through the kinds of reasons that are deployed in them. The reasons for -which an agreed political authority decides what will happen are vari- -ous, and the decision in various ways may aVect people’s liberty, but the -decision is not itself an announcement of what is a rightful claim in lib- -erty. -In the very special case of a polity that has an institution of judicial -review, executive and legislative decisions can be checked against claims -in liberty. In such a state, some political decisions, in the widest sense, -are judicial ones: i.e., the decision which decides what will happen is -made for judicial reasons. (This is not the same as the familiar charge, in -criticism of such an institution or of its operation, that some of these -judicial decisions are, in a narrower sense, political ones.) But even here -the sense that one’s liberty is restricted by a decision cannot be identi- -Wed with the thought that the court, if it acted rightly, would grant or -would have granted or indeed should have granted one’s claim in liberty. -One may agree that the court, if it was doing its job properly, would not -have granted such a claim, but one can still feel that the decision re- -stricts or even violates one’s liberty. First, the court itself may accept that -its decision, though rightful, involves a cost in liberty.16 A more general -reason, however, is that judicial reasons, the kinds of reason that a con- -stitutional court, however inventive, must attend to, are only one kind of -reason. (Even those such as Dworkin who think that judicial review should -include explicit and wide-ranging moral reasons accept that since these -are decisions within a given legal system, they are bound by other con- -straints, such as stare decisis.) So the person who feels his liberty injured -may feel this in virtue of other reasons, indeed other reasons of prin- -ciple, which he does not suppose would vindicate a claim of right in the -judicial forum. If he is angry at the outcome, then the focus of his anger -might be this, that things are such that the Wnal court of appeal must -rightfully decide against him, and this thought might survive the under- -standing that given the legal history and the court’s situation there was -no realistic alternative to things being this way. -From Freedom to Liberty: -17 -The Construction of a -Political Value -The thought that an action, say a political decision, involves a cost in -one’s liberty does not necessarily involve the thought that one would have -a rightful claim in liberty before some speciWed or indeed unspeciWed -authority. So what does go into the idea of a cost in liberty? We should -recall that we are trying to construct this idea as part of constructing an -idea of liberty itself which will serve our needs. The construction started -from certain experiences associated with perceived limitations on primi- -tive freedom. We should turn back to that again, and approach the con- -struction of the idea of a cost in liberty by considering what it is to feel -that something involves a cost in one’s liberty. -VII. RESENTMENT AND OTHER SUCH REACTIONS -When I considered in the Wrst place the transition from primitive free- -dom to liberty, I said that the reaction to coercion in the most elemen- -tary case was resentment. But the experience of feeling that one’s liberty -is being restricted need not necessarily take the form of resentment. How -far it can be expected to do so is not an easy question to pursue, because -resentment so readily merges into other negative feelings, such as anger -and dislike, not just for conceptual but also for various familiar psycho- -logical reasons. In relation to freedom, the primitive and purest case of -resentment is perhaps that in which another person acts manifestly and -eVectively in a way that prevents me from doing what I want, and does -so with that intention, and I think, moreover, that there is nothing to be -said at all in favor of his doing so from any point of view except his. There -are of course many cases of resentment in which this strongest condi- -tion is not satisWed. I may think, for instance, that the action was in my -long-term interests, even that it was done with that intention, and still I -may resent it. (Of course there may be a problem in such a case of sort- -ing out what exactly it is that I resent—I may just resent, for instance, the -fact that he took for granted his own ideas about my interests.) -It is usually said that the particular reaction of resentment is tied to -the idea of the other person’s action being not rightful. If we accept this -idea, and also identify as (necessarily) resentment the feelings that go -with a sense of a restriction on one’s liberty, we shall be back on the road -to Rousseau’s outlook. But I think that we should loosen both these con- -nections. Resentment is not so closely tied to the idea of right,17 and a -Philosophy & Public AVairs -18 -sense of coercion or restricted liberty can be connected to reactions that -range more widely than resentment in the strictest sense. A helpful con- -sideration here is the extent to which the person whose liberty is in ques- -tion is identiWed with the actions that might be felt to restrict or violate -that liberty. This idea helps us to explain the case of the citizen who thinks -that a certain political decision is both procedurally correct and right in -principle, but nevertheless experiences its consequences for himself as -a cost in liberty. The reason that this is possible is that his sense of him- -self is not entirely that of a person identiWed with the state’s decisions, -however rightful. Rousseau of course wanted each person in a virtuous -republic to be identiWed totally with himself or herself as citizen, but it is -inevitable and appropriate and an entirely good thing that on any con- -ception of a modern society—and I suspect also, on a realistic concep- -tion of any society whatsoever—this is not going to be so.18 -Someone who disapproves of a measure in principle but not on pro- -cedural grounds is less identiWed with it than someone who approves of -it in both these respects. Someone who Wnds it both procedurally and in -principle objectionable is even less identiWed with it, and one who thinks -that all the procedures are a sham is less identiWed still. At the end of this -line, when the action that constrains someone is experienced as nothing -but coercion, sheer force in the interests of others, the lack of identiWca- -tion is total, and this certainly is resentment. But right from the begin- -ning of this progression there is room for the idea that the action, what- -ever there is to be said for it, is a limitation of someone’s liberty, to the -extent that he identiWes with the desires and projects which this action -will frustrate. -It is not a necessary condition of there being a cost in someone’s lib- -erty or a restriction of it that he has such experiences of resentment, frus- -tration, or whatever. This takes us back to a point we noticed earlier in -this construction, in the example of the happy slaves. We deplore their -by an action projects on it the idea that it is not rightful. But then the idea of right must be -salient in those particular cases, precisely because the reaction is identiWed as a moralistic -rationalization. We can recognize resentment in less moralized circumstances: for instance, -where A bears a grudge against B because B beat him (fairly) in a contest. -From Freedom to Liberty: -19 -The Construction of a -Political Value -lack of liberty; they—we are fancifully supposing—do not. But if they do -not, is there anything, on the present line, on which we can build our -complaint? I suggested earlier that there is, in what I called the Critical -Theory principle. The slaves are subject to a regime which (simply as a -matter of fact) would pursue much the same objectives whatever they -desired. We are supposing that they do not experience any frustration, -although they are not allowed to satisfy some desires that human beings -in general might be expected to have (e.g., they cannot marry or travel or -stop work.) In actual fact, of course, it is very unlikely that they will not -feel frustrated in these respects, which is what makes this a rather objec- -tionable fantasy, but suppose it to be so. In addition, they do not have -certain other desires or aspirations which others have in those historical -circumstances, such as a desire for political representation. In both re- -spects, the state of their desires is identiWably a product of that regime, a -regime, moreover, which would not be responsive even if they had the -desires in question. In those circumstances, the absence of the desires -does not refute the complaint in liberty, once it is made; if anything, it -gives it extra force. It is the Critical Theory principle that explains, I think, -why a complaint in liberty is not turned away in such a situation, and -hence why the presence of frustrated desire is not a necessary condition -of a cost in liberty.19 -VIII. LIBERTY NOW -Let us try to assemble some conditions on liberty. We may recall -(i) A practice is not a violation of liberty if it is necessarily involved -in there being a state at all. -However, -(ii) The principle of (i) cannot be relativized to a particular state or -polity, since particular states or polities can obviously be criticized for -violations or undue restrictions of liberty. At the same time, there is -Philosophy & Public AVairs -20 -limited interest in comparing all existing states to some ideal model -of a state. In particular, what desires or frustrations people might have -under increasingly counterfactual conditions is increasingly indeter- -minate. Utopian political discourse is of course possible and may have -its uses, but it is at best obliquely related to arguments about the lib- -erty we can hope to Wnd in our world. This is not to say that Utopian -discourses about liberty are analytically or deWnitionally incoherent. -In terms of the broadest construction of liberty, we can Wnd a place -for some of them, if they are not otherwise too incoherent. But they, -and the comparisons they invite with the actual, do not do much for -the more speciWc construction of liberty as a value for us. -In pursuing that construction, it seems to me that we should restrict the -Utopia factor by accepting in particular that -(iii) Modernity is a basic category of social and hence of political un- -derstanding, and so a politically useful construction of liberty for us -should take the most general conditions of modernity as given. This -was the lesson of Benjamin Constant’s marvelous speech, given in 1819, -The Liberty of the Ancients compared with that of the Moderns,20 in -which he pointed out that whatever the merits for an ancient republic -of a concept of liberty linked to republican virtue, they were essen- -tially limited to the conditions of an ancient republic, and only disas- -ter could follow, as indeed it had followed in France, from trying to -apply such an ideal to a modern commercial society. -Of course there is room for much argument about what the condi- -tions of modernity are, what forms a modern society can intelligibly take, -and so on: but that is as it should be, for that is the substance of much -signiWcant political argument. But granted in a general sense the condi- -tions of modernity as shaping the construction of our idea of liberty, there -will be a variety of consequences. For instance, I mentioned earlier a -range of things that can count as coercive restrictions on an agent’s do- -ing what he wants, intentional activities of others that can count as lim- -iting freedom. In the context of modernity, it will be clear why in general -factor (C) above, the eVects of competition in something like a zero-sum -From Freedom to Liberty: -21 -The Construction of a -Political Value -game, will not count, because competition is integral to the social sys- -tem. -This is not to deny that there can be political arguments to the eVect -that certain kinds of competition are so damaging to the general inter- -est, and perhaps to the interests of losers, that they need to be controlled: -it is merely that these are not per se arguments based on the losers’ lib- -erty. Rather similarly, factor (D) above—by-products of another enter- -prise not aimed at the person in question—do not presumptively count -as limiting that person’s liberty, though there are many special cases in -which they do so. This is because they are a ubiquitous phenomenon -essentially connected with the society’s central activities. Factor (E), on -the other hand, arrangements which structurally limit the opportunities -of some class of citizens, are more likely to count, and complaints about -power structures which have such eVects are readily understood as com- -plaints in liberty. This is because we have a better and typically modern -understanding of such power structures, and, we hope, some achievable -means of changing the situation. -Granted that a person’s complaint that he has sustained a cost of lib- -erty lies within such limits implicit in the conditions of modernity, how- -ever exactly we understand them; granted the wider condition (i), that -the restriction is not one that would be necessary under any state; and -granted of course that it is factually correct, that is to say that his desires -really are frustrated or limited by the activities about which he is com- -plaining; then we should accept the idea that emerged from the earlier -arguments, that if someone feels that some action or arrangement im- -poses on him a cost in liberty, then it does indeed do so. This does not -mean, of course, that the action or arrangement should not be allowed: -the cost in his liberty is very often outweighed by the values served by -the action or arrangement. Moreover, it need not justify or call for any -compensation. He need not have a claim in liberty in any court. But a -cost in liberty is still what it is, even if he quite properly has to carry the -cost himself. -A construction of liberty on these lines might be thought to spread -the idea of a cost in liberty too wide. It means that, within certain limits, -anyone with a grievance or who is frustrated by others’ actions can ap- -propriately complain about restrictions on his liberty. If “appropriately” -means that it is semantically, conceptually, indeed psychologically, in- -Philosophy & Public AVairs -22 -telligible that he should do so, that is right. If it means that it is necessar- -ily useful, helpful, to be taken seriously as a contribution to political de- -bate, and not a waste of everyone’s time, it is not right. The point is that -these latter considerations are in the broadest sense political consider- -ations, and that is the point of the construction. -The conditions I have suggested for complaints of the loss of liberty -might be expressed in terms of “realism.” A form of liberty that could not -be oVered by any state is an entirely unrealistic basis of objection, and -the limitation to the conditions of modernity implies a further step to- -wards a realistic political position or claim, which can be taken seriously. -It may be said that there are two diVerent questions here, which this ap- -proach runs together: whether it is true that someone has sustained a -cost in liberty and whether it is sensible, useful, reasonable, or sane to -complain about it. These ideas are indeed not the same. It is not a reason -for supposing that there has been no loss of liberty, that it is not politi- -cally prudent to say that there has been: the loss of liberty lies in the -good sense attached to the resentment, not in the good sense or other- -wise of expressing it. However, what it is reasonable to count as some- -thing that it is sensible for someone to resent is a matter of one’s overall -view of the political world, and so, while the two ideas are certainly dis- -tinct, there is an extensive area in which they overlap, and a properly -political conception of liberty acknowledges this. Resentment about the -loss of liberty, like resentment about anything else, implies the thought -of an alternative world in which that loss does not occur, and just be- -cause liberty is a political value, the distance of that possible world from -the actual world must be measured in terms of political considerations -of relevance and practical intelligibility. The world of the anarchists is -too far away—too far away from anything—to ground complaints in lib- -erty at all. Many complaints that Xy in the face of modernity equally do -not even cross the threshold of oVering a serious political consideration. -It is also true, of course, that even if “Utopian politics” is a contradic- -tion in terms, “Utopian political thought” is not, and someone may make -a case for taking seriously complaints in liberty that would not get a hear- -ing in everyday political activity. He may show that some dimension of -resentment is more sensible than conventional opinion supposes; or he -may, just as eVectively or more so, claim that whether it is what people -call “sensible” is not the point. The aim, he may rightly say, is to change -the world, and his elevation of his or others’ resentment into a complaint -From Freedom to Liberty: -23 -The Construction of a -Political Value -about liberty may indeed succeed in making it into a complaint about -liberty. -What we should be arguing about with such a complainant, if it is worth -arguing with him at all, is whether it is in the least sensible for him to -expect that a desire of that kind should not be frustrated; whether his -conception of a social world in which it would not be frustrated is not a -fantasy, either in general or in relation to historical circumstances in -which he necessarily Wnds himself; whether, on reXection, he does not -identify more deeply with the considerations that justify the coercion -than with his original desire. These are the materials of political persua- -sion, in the broadest sense, and this is what we should be engaged in. A -major aim of constructing liberty in the way I have suggested is that it -should leave space in which these arguments can take place. -There is a further and benevolent consequence. He may indeed per- -suade us our sense of what is “realistic” will change, and with it, the di- -mensions of liberty. But if, on the other hand, our persuasions succeed, -he will cease to feel the frustration. His resentment will go away. He may -come to identify fully with the grounds of coercion in such a case; he -may cease to desire what he originally desired; in any case he will not -care any more that he cannot have what he desires. If this happens, then, -on the construction I am oVering, there will be no frustrated desire (and -not for reasons that fail the Critical Theory test); so his liberty will no -longer be restricted, and there will no longer be a cost in liberty. -IX. THE VALUE OF LIBERTY -Someone may ask why liberty is a value at all. This might mean, why is -liberty in any of the various constructions that have been given of it in -diVerent historical circumstances a value at all? Why should human be- -ings in general be concerned with some value of that form? I do not know -that I can answer that question, beyond suggesting a set of questions to -put in its place: What view would one have to take of one’s desires and -projects and other values if there were never even a question of its being -something to be resented and resisted if others aimed to frustrate them? -What view would one have to take of those others, in particular of a po- -litical authority, for that question never to arise? -A better question might be: why is liberty the special value it is for us? -Why does it play the particular role that it does in our political thought -Philosophy & Public AVairs -24 -and aspirations? In particular, why is it so important? That question must -be directed to liberty under the kind of construction that is appropriate -to our circumstances, and one answer to it, an “internal” answer, will lie -in inviting the questioner to think about liberty in terms of those cir- -cumstances and in relation to other political values and beliefs that be- -long to our world. We invite him to acknowledge who and where he is, -and ask him what alternative he has to this structure of ideas and at what -Utopian distance the alternative, and the political arrangements that -might go with it, lie from the world in which we and he all live. We can -argue about the merits of those other arrangements, and this will be, -once more, a political argument, one that works with the materials which, -in this condition, he and we can use. -This is Wne, so far as it goes. Yet there is something unsatisfactory about -saying just this much. On the one hand, we are insisting that if we are to -think realistically about political values, we must do so, so to speak, from -here. At the same time, indeed in making this very statement, we seem to -acknowledge that “here” is just one place among others: that we can con- -sider the modern condition, our condition, to some extent from the out- -side and compare it with others. If we can do that, then we should be -able to say rather more than we have said about this modern construc- -tion of liberty, and its value, as compared with others. This touches on a -familiar point which I mentioned very brieXy before. One of the most -prominent characteristics of modernity is its historical self-conscious- -ness, and that carries with it certain demands on how we understand -ourselves. What we have said to this questioner so far does not seem to -do enough to meet those demands. Can we do any more? -Perhaps we can. In conclusion, I shall try to sketch in the barest out- -line some more that we might say. To do so, I must go back for a last time -to primitive freedom and its being, as I put it, a “proto-political” con- -cept. I argued that primitive freedom is not itself a political value (and -perhaps not a value of any kind). This is because the notion of a political -value implies an impartial standpoint to determine the priority of diVer- -ent agents’ desires, a standpoint which is not given simply by the idea of -each person’s desires. That standpoint must be that of an authority with -a power to enforce. Once we have such an authority, I said, the question -of freedom and coercion arises again, now in relation to the coercion -which the authority exerts. If this is not to be merely another contribu- -tion to conXict, the authority must have authority; and this means that -From Freedom to Liberty: -25 -The Construction of a -Political Value -in some terms or other, it must be acknowledged as legitimate. Let us -now say there is need for legitimate government (where this means that -it is counted or recognized as legitimate in a given society, not that we -would necessarily accept it by our standards of legitimacy). -I take it that the following is a universal truth: legitimate government -is not just coercive power. It is true, moreover, in the sense of “legiti- -macy” I am using, in which the idea is relativized to local understand- -ings: everyone everywhere where there is such a thing as government -recognizes some distinction between legitimate government and a mere -conspiracy of eVective coercion, even if many people have lived and do -live under such a conspiracy or in a state which is not much more. For -there to be legitimate government, there must be a legitimation story, -which explains why state power can be used to coerce some people rather -than others and to allow people to restrict other people’s freedom in some -ways rather than others. Moreover, this story is supposed to legitimate -the arrangements to each citizen, that is to say, to each person from whom -the state expects allegiance; though there may be other people within -the state, slaves or captives, who are nakedly the objects of coercion and -for whom there is no such legitimation story.21 -The fact that everywhere there is a legitimation story to be told to each -citizen does not imply, of course, that in terms of the story there is some -presumption that citizens should be treated equally. Most such stories -in the past have delivered various forms of inequality and hierarchy, with -corresponding constraints on the activities of some citizens in relation -to other citizens and to the state itself. The fact that there is a legitima- -tion story to be told is indeed enough to distinguish these societies as -examples of legitimate government, in contrast to mere successful ex- -amples of banditry. The signiWcant point for us, however, and for our -construction of liberty and the value we attach to it, is that we do not -believe these stories, and it is a notable feature of modernity that we do -not. I do not mean merely that we do not accept the stories as legitimat- -ing stories for us. I mean that to a considerable degree we regard the -Philosophy & Public AVairs -26 -content of these stories, in particular those that involve religious or other -transcendental justiWcations, as simply untrue. It follows—or would fol- -low with much further argument—that in telling our own legitimation -story we start, in a sense, with less. In interpreting and distributing lib- -erty we allow each citizen a stronger presumption in favor of what he or -she certainly wants, to carry out his or her own desires. -Of course the presumptions in favor of equal and extensive liberty in -modern societies are intimately connected with the central activities of -those societies, in particular their forms of economic organization. This -is an historical platitude, but by itself it will not help our questioner who -wanted to hear more of why we value liberty as we do. Something on the -lines of the absurdly rough sketch I just outlined can perhaps give him -more. The sketch indeed connects our construction of liberty, and the -value we give it under that construction, with the condition of moder- -nity, but it oVers more than the consideration (which is in itself a per- -fectly sound consideration) that this is our condition. It connects our -ideas of liberty with a universal truth, that everywhere legitimacy requires -more than mere coercion, and it adds to this the conviction that under -the conditions of modernity, whatever else may be worse, we at any rate -have a better grasp on the truth. I do not mean on the truth about lib- -erty—in relation to this questioner, that would be marching on the spot. -Rather, we have a grasp on truths that destroy those fantasies that once -provided the fabric of pre-modern legitimation stories. -If that account could be made good, it would yield the conclusion that -modern societies, or some of them, are rightly more concerned with lib- -erty and aim to deliver more of it than did earlier societies. Of course, the -liberty they aim to deliver is understood or constructed in terms appro- -priate to modernity, but that does not make their promise merely circu- -lar or empty. It is backed by the idea that whatever else they may have -taken away or made impossible, modern societies can oVer and perhaps -sustain a construction of liberty in which the constraints on it are fewer -and, above all, more truthfully motivated than in most societies of the -past. - - -References - diff --git a/bin/test.cite b/bin/test.cite deleted file mode 100644 index 1cf0938..0000000 --- a/bin/test.cite +++ /dev/null @@ -1,78 +0,0 @@ -1. John Rawls has said in Political Liberalism (New York: Columbia University Press, 1993), -p. xvi, “In [A] Theory [of Justice] a moral conception of justice general in scope is not distin- -guished from a strictly political theory of justice,” and the aim of the later book is to give -such a political theory. But the later account still represents the political conception as -itself a moral conception, although one directed to a special subject matter (p. 11). It is -signiWcant how far moral conceptions still structure the theory: the solution to the central -problem of the stability of a just society, for instance, is worked out in terms of the moral -powers of its citizens. -2. The somewhat Manichean distinction between “principle” and “policy,” where the -latter is understood in consequentialist terms, is sometimes understood as roughly paral- -lel to that in the United States between the Supreme Court and the Congress. To the ex- -tremely limited extent that this is true, it can be regarded as a special product of history as -well as something of a misfortune. -3. Carl Schmitt, Das BegriV des Politischen translated as The Concept of the Political -(Chicago: University of Chicago Press, 1996). -4. Reply to the Second Set of Objections to the Meditations: The Philosophical Writings -of Descartes, vol. 2, translated by John Cottingham (Cambridge: Cambridge University Press, -1984), p. 94. -5. John Locke, Essay on Human Understanding, ii.1.56. -6. Isaiah Berlin, “Two Concepts of Liberty” (1958), reprinted in Four Essays on Liberty -(Oxford: Oxford University Press, 1969). -7. On the distinction between negative and positive freedom, see Gerald C. MacCallum, -Jr., “Negative and Positive Freedom,” Philosophical Review 76 (1967); John Rawls, A Theory -of Justice (Oxford: Clarendon Press, 1972), sec. 32. -8. Hobbes famously argued that such things do not reduce freedom, but merely raise -the cost of a particular course of action. Although it suited Hobbes’s purpose to treat this as -a consideration relevant to the theory of political freedom, it is better understood in the -context of an account of voluntary action: the fact that an action is coerced in this sense -does not mean, standardly, that it fails to be a fully intentional action. -9. As is argued by Raymond Geuss in History and Illusion in Politics (Cambridge: Cam- -bridge University Press, 2001), pp. 96–98. -10. Quentin Skinner (“The Paradoxes of Political Liberty,” in S. M. McMurrin, ed., Tan- -ner Lectures on Human Values VII [Salt Lake City: University of Utah Press, 1986]) points -out that this is not a paradox in the context of positive liberty theory. Indeed. But since it -is a paradox, that is a problem for the theory. -11. More irresponsibly than the tradition of republican liberty, which, as Skinner has -shown (“The Paradoxes of Political Liberty”), is something diVerent. It is not surprising, -however, that it should be suspect for some of the same reasons: see note 18. -12. Fragment B23, in Herman Diels and Walther Kranz, Die Fragmente der Vorsokratiker, -6th ed. (Berlin: Weidmann, 1951–52) -13. Geuss (History and Illusion in Politics) refers to this remark, p. 104, 108–9, but he -does not discuss it in relation to the argument mentioned above at note 9. -14. The following arguments suggest that it is not a value of any kind, but I shall not take -up that question here. -15. Ronald Dworkin, Sovereign Virtue (Cambridge, Mass: Harvard University Press, 2000), -ch. 3. It is fair to say that Dworkin’s disinclination to accept conXicts between liberty and -equality depends as much on his account of equality as on his account of liberty. I am -grateful to Dworkin for many discussions of this subject, which have done much to shape -the present discussion. -16. The U.S. Supreme Court itself implicitly accepts this when it engages in “balancing.” -An illustration is the “undue burden” test for the constitutionality of regulations on abor- -tion: Planned Parenthood v. Casey, 505 US 833 (1992). (I am indebted here and elsewhere to -Robert Post.) -17. The idea that resentment is grounded in thoughts about right is encouraged by the -familiar phenomenon of back-formation, in which someone who is merely disadvantaged -18. Here Rousseau’s outlook coincides with the tradition of republican virtue (see note 11 -above). The idea that in a virtuous ancient republic the constraint to engage in public ser- -vice did not involve a cost in liberty, if it implies anything about citizens’ actual reactions, -should surely be treated with some skepticism. If it says, rather, that because an ideally -rational citizen would not react in that way, those reactions do not count, republican lib- -erty will certainly court many of the same dangers as “positive liberty.” -19. It is not suggested that this is a suVicient account of a Critical Theory test. Obviously, -beliefs and states of desire can be quite properly the causal product of regimes to which -people have been exposed or even subjected: educational regimes, for instance. Further -questions are involved: partly, about the kinds of belief in question, and what they, or the -presence or absence of certain desires, are supposed to justify; partly, about the attitude -that the people would have to the beliefs or desires if they knew how they came about. I -discuss some of the problems involved in Telling and Truthfulness (Princeton: Princeton -University Press, forthcoming.) -20. See Benjamin Constant, Political Writings, ed. Biancamaria Fontana (Cambridge: -Cambridge University Press, 1988), p. 309 V. Cf in these connections “St Just’s Illusion,” in -my Making Sense of Humanity (Cambridge: Cambridge University Press, 1995). -21. I have claimed in Shame and Necessity (Berkeley: University of California Press, 1993), -ch. 5, that this was the situation with slavery in the ancient world, which was typically re- -garded as necessary rather than just: the Helots in Sparta were indeed explicitly under- -stood to be enemies in captivity. The racist justiWcations of modern slavery were presum- -ably meant in some sense to legitimate the institution; I am less clear how far they were -meant to legitimate it to the slaves. \ No newline at end of file diff --git a/bin/test.out b/bin/test.out deleted file mode 100644 index 982a31b..0000000 --- a/bin/test.out +++ /dev/null @@ -1,181 +0,0 @@ - - - - -From Freedom to Liberty: The Construction of a Political Value Williams, Bernard Arthur Owen. Philosophy & Public Affairs, Volume 30, Number 1, Winter 2001, pp. 3-26 (Article) Published by Princeton University Press DOI: 10.1353/pap.2001.0015 For additional information about this article -http://muse.jhu.edu/journals/pap/summary/v030/30.1williams.html -Access Provided by Cambridge University Library at 06/03/10 11:08AM GMT From Freedom to Liberty -BERNARD WILLIAMS -The Construction of a -Political Value - - - - - - -John - -Rawls has said in Political Liberalism (New York -1993 -Columbia University Press -1. -John Rawls has said in Political Liberalism (New York: Columbia University Press, 1993), p. xvi, “In [A] Theory [of Justice] a moral conception of justice general in scope is not distinguished from a strictly political theory of justice,” and the aim of the later book is to give such a political theory. But the later account still represents the political conception as itself a moral conception, although one directed to a special subject matter (p. 11). It is signiWcant how far moral conceptions still structure the theory: the solution to the central problem of the stability of a just society, for instance, is worked out in terms of the moral powers of its citizens. - - -The somewhat Manichean distinction between “principle” and “policy,” where the latter is understood in consequentialist terms, is sometimes understood as roughly parallel to that in the United States between the Supreme Court and the Congress. To the extremely limited extent that this is true, it can be regarded as a special product of history as well as something of a misfortune -2. -The somewhat Manichean distinction between “principle” and “policy,” where the latter is understood in consequentialist terms, is sometimes understood as roughly parallel to that in the United States between the Supreme Court and the Congress. To the extremely limited extent that this is true, it can be regarded as a special product of history as well as something of a misfortune. - - - -Carl Schmitt - -Das BegriV des Politischen translated as The Concept of the Political (Chicago -1996 -University of Chicago Press -3. -Carl Schmitt, Das BegriV des Politischen translated as The Concept of the Political (Chicago: University of Chicago Press, 1996). - - -Reply to the Second Set of Objections to the Meditations: The Philosophical Writings of Descartes -1984 -2 -94 -Cambridge University Press -4. -Reply to the Second Set of Objections to the Meditations: The Philosophical Writings of Descartes, vol. 2, translated by John Cottingham (Cambridge: Cambridge University Press, 1984), p. 94. - - - -John Locke - -Essay on Human Understanding - -5. -John Locke, Essay on Human Understanding, ii.1.56. - - - -Isaiah Berlin - -Two Concepts of Liberty -1958 -Oxford University Press -6. -Isaiah Berlin, “Two Concepts of Liberty” (1958), reprinted in Four Essays on Liberty (Oxford: Oxford University Press, 1969). - - - -Gerald C MacCallum - -On the distinction between negative and positive freedom, see -1967 -Philosophical Review -76 -32 -Clarendon Press -7. -On the distinction between negative and positive freedom, see Gerald C. MacCallum, Jr., “Negative and Positive Freedom,” Philosophical Review 76 (1967); John Rawls, A Theory of Justice (Oxford: Clarendon Press, 1972), sec. 32. - - -Hobbes famously argued that such things do not reduce freedom, but merely raise the cost of a particular course of action. Although it suited Hobbes’s purpose to treat this as a consideration relevant to the theory of political freedom, it is better understood in the context of an account of voluntary action: the fact that an action is coerced in this sense does not mean, standardly, that it fails to be a fully intentional action -8. -Hobbes famously argued that such things do not reduce freedom, but merely raise the cost of a particular course of action. Although it suited Hobbes’s purpose to treat this as a consideration relevant to the theory of political freedom, it is better understood in the context of an account of voluntary action: the fact that an action is coerced in this sense does not mean, standardly, that it fails to be a fully intentional action. - - -As is argued by Raymond Geuss in History and Illusion in Politics (Cambridge -2001 -96--98 -Cambridge University Press -9. -As is argued by Raymond Geuss in History and Illusion in Politics (Cambridge: Cambridge University Press, 2001), pp. 96–98. - - -The Paradoxes of Political Liberty -1986 -Quentin Skinner -University of Utah Press -10. -Quentin Skinner (“The Paradoxes of Political Liberty,” in S. M. McMurrin, ed., Tanner Lectures on Human Values VII [Salt Lake City: University of Utah Press, 1986]) points out that this is not a paradox in the context of positive liberty theory. Indeed. But since it is a paradox, that is a problem for the theory. - - -More irresponsibly than the tradition of republican liberty, which, as Skinner has shown (“The Paradoxes of Political Liberty”), is something diVerent. It is not surprising, however, that it should be suspect for some of the same reasons: see note 18 -11. -More irresponsibly than the tradition of republican liberty, which, as Skinner has shown (“The Paradoxes of Political Liberty”), is something diVerent. It is not surprising, however, that it should be suspect for some of the same reasons: see note 18. - - -Fragment B23, in Herman Diels and Walther Kranz, Die Fragmente der Vorsokratiker, 6th ed -(Berlin: Weidmann, 1951–52) -12. -Fragment B23, in Herman Diels and Walther Kranz, Die Fragmente der Vorsokratiker, 6th ed. (Berlin: Weidmann, 1951–52) - - -Geuss (History and Illusion in Politics) refers to this remark, p. 104, 108–9, but he does not discuss it in relation to the argument mentioned above at note 9 -13. -Geuss (History and Illusion in Politics) refers to this remark, p. 104, 108–9, but he does not discuss it in relation to the argument mentioned above at note 9. - - -The following arguments suggest that it is not a value of any kind, but I shall not take up that question here -14. -The following arguments suggest that it is not a value of any kind, but I shall not take up that question here. - - - -Ronald Dworkin - -Sovereign Virtue -2000 -Harvard University Press -Cambridge, Mass -15. -Ronald Dworkin, Sovereign Virtue (Cambridge, Mass: Harvard University Press, 2000), ch. 3. It is fair to say that Dworkin’s disinclination to accept conXicts between liberty and equality depends as much on his account of equality as on his account of liberty. I am grateful to Dworkin for many discussions of this subject, which have done much to shape the present discussion. - - - -U S The - -Supreme Court itself implicitly accepts this when it engages in “balancing.” An illustration is the “undue burden” test for the constitutionality of regulations on abortion: Planned Parenthood v -1992 -Casey, 505 US -833 -16. -The U.S. Supreme Court itself implicitly accepts this when it engages in “balancing.” An illustration is the “undue burden” test for the constitutionality of regulations on abortion: Planned Parenthood v. Casey, 505 US 833 (1992). (I am indebted here and elsewhere to Robert Post.) - - -The idea that resentment is grounded in thoughts about right is encouraged by the familiar phenomenon of back-formation, in which someone who is merely disadvantaged -17. -The idea that resentment is grounded in thoughts about right is encouraged by the familiar phenomenon of back-formation, in which someone who is merely disadvantaged - - -Here Rousseau’s outlook coincides with the tradition of republican virtue (see note 11 above). The idea that in a virtuous ancient republic the constraint to engage in public service did not involve a cost in liberty, if it implies anything about citizens’ actual reactions, should surely be treated with some skepticism. If it says, rather, that because an ideally rational citizen would not react in that way, those reactions do not count, republican liberty will certainly court many of the same dangers as “positive liberty -18. -Here Rousseau’s outlook coincides with the tradition of republican virtue (see note 11 above). The idea that in a virtuous ancient republic the constraint to engage in public service did not involve a cost in liberty, if it implies anything about citizens’ actual reactions, should surely be treated with some skepticism. If it says, rather, that because an ideally rational citizen would not react in that way, those reactions do not count, republican liberty will certainly court many of the same dangers as “positive liberty.” - - -It is not suggested that this is a suVicient account of a Critical Theory test. Obviously, beliefs and states of desire can be quite properly the causal product of regimes to which people have been exposed or even subjected: educational regimes, for instance. Further questions are involved: partly, about the kinds of belief in question, and what they, or the presence or absence of certain desires, are supposed to justify; partly, about the attitude that the people would have to the beliefs or desires if they knew how they came about. I discuss some of the problems involved in Telling and Truthfulness (Princeton -Princeton University Press, forthcoming -19. -It is not suggested that this is a suVicient account of a Critical Theory test. Obviously, beliefs and states of desire can be quite properly the causal product of regimes to which people have been exposed or even subjected: educational regimes, for instance. Further questions are involved: partly, about the kinds of belief in question, and what they, or the presence or absence of certain desires, are supposed to justify; partly, about the attitude that the people would have to the beliefs or desires if they knew how they came about. I discuss some of the problems involved in Telling and Truthfulness (Princeton: Princeton University Press, forthcoming.) - - - -See Benjamin Constant - -Political Writings, ed. Biancamaria Fontana (Cambridge -1988 -309 V. Cf in these connections “St Just’s Illusion,” in my Making Sense of Humanity (Cambridge -p. -Cambridge University Press -20. -See Benjamin Constant, Political Writings, ed. Biancamaria Fontana (Cambridge: Cambridge University Press, 1988), p. 309 V. Cf in these connections “St Just’s Illusion,” in my Making Sense of Humanity (Cambridge: Cambridge University Press, 1995). - - -I have claimed in Shame and Necessity (Berkeley: University of California Press -1993 -21. -I have claimed in Shame and Necessity (Berkeley: University of California Press, 1993), ch. 5, that this was the situation with slavery in the ancient world, which was typically regarded as necessary rather than just: the Helots in Sparta were indeed explicitly understood to be enemies in captivity. The racist justiWcations of modern slavery were presumably meant in some sense to legitimate the institution; I am less clear how far they were meant to legitimate it to the slaves. - - - - \ No newline at end of file diff --git a/bin/test.txt b/bin/test.txt deleted file mode 100644 index 213e361..0000000 --- a/bin/test.txt +++ /dev/null @@ -1,965 +0,0 @@ -From Freedom to Liberty: The Construction of a Political Value -Williams, Bernard Arthur Owen. -Philosophy & Public Affairs, Volume 30, Number 1, Winter 2001, pp. 3-26 (Article) -Published by Princeton University Press -DOI: 10.1353/pap.2001.0015 -For additional information about this article -http://muse.jhu.edu/journals/pap/summary/v030/30.1williams.html -Access Provided by Cambridge University Library at 06/03/10 11:08AM GMT -From Freedom to Liberty: -BERNARD WILLIAMS -The Construction of a -Political Value -I. INTRODUCTION -My subject is freedom and in particular freedom as a political value. Many -discussions of this topic consist of trying to deWne the idea of freedom, -or various ideas of freedom. I do not think that we should be interested -in deWnitions. I leave aside the very general philosophical point that if -we mean, seriously, deWnitions, there are no very interesting deWnitions -of anything. There is a more particular reason. In the case of ethical and -political ideas, what puzzles and concerns us is the understanding of -those ideas—in the present case, freedom—as a value for us in our world. -I do not mean that we are interested in it only as it Wgures in precisely -our set of values—meaning by that, those of a liberal democratic society. -Manifestly it is equally part of our world that such ideas are also used by -those who do not share our values or only partly share them—those with -whom we are in confrontation, discussion, negotiation, or competition, -with whom in general we share the world. Indeed, we will disagree among -ourselves about freedom within our own society. We experience conXicts -between freedom and other values, and—a point I shall emphasize—we -understand some desirable measures as involving a cost in freedom. -Whatever our various relations may be with others in our world who -do or do not share our conception of freedom, we will not understand -our own speciWc relations to that value unless we understand what we -want that value to do for us—what we, now, need it to be in shaping our -An earlier version of this paper was given as the Dewey Lecture at the University of Chi- -cago Law School, April 2001. -© 2001 by Princeton University Press. Philosophy & Public Affairs 30, no. 1 -Philosophy & Public AVairs -4 -own institutions and practices, in disagreeing with those who want to -shape them diVerently, and in understanding and trying to co-exist with -those who live under other institutions. -In all their occurrences, these various conceptions or understandings -of freedom, including the ones we immediately need for ourselves, in- -volve a complex historical deposit, and we will not understand them -unless we grasp something of that deposit, of what the idea of freedom, -in these various connections, has become. This contingent historical -deposit, which makes freedom what it now is, cannot be contained in or -anticipated by anything that could be called a deWnition. It is the same -here as it is with other values: philosophy, or as we might say a priori -anthropology, can construct a core or skeleton or basic structure for the -value, but both what it has variously become, and what we now need it -to be, must be a function of actual history. In the light of this, we can say -that our aim is not to deWne but to construct a conception of freedom. -I shall not attempt a general account of what might count as construct- -ing one or another conception of freedom. One might say that the no- -tion of construction applies at diVerent levels. We need to construct a -value of freedom speciWcally for us; and we need a more generic con- -struction or plan of freedom which helps us to place other conceptions -of it in a philosophical and historical space—which shows us, one might -say, how other speciWc conceptions might be constructed in their own -right. Some of the questions raised by these requirements would simply -be a matter of terminology, of how we might use the term ‘construction.’ -But there is a more signiWcant consideration which links these two lev- -els. The conception of freedom we need for ourselves is both historically -self-conscious and suitable to a modern society—and those two features -are of course related to one another. Because of this, our own speciWc -and active conception of freedom, the one we need for our practical pur- -poses, will contain implicitly the materials for a reXective understand- -ing of the more general possibilities of construction. -However, it is just as important that the disputes that have circled -around the various deWnitions and concepts of liberty do not just repre- -sent a set of verbal misunderstandings. They have been disagreements -about something. There is even a sense in which they have been dis- -agreements about some one thing. There must be a core, or a primitive -conception, perhaps some universal or widely spread human experience, -to which these various conceptions relate. This does not provide, as it -From Freedom to Liberty: -5 -The Construction of a -Political Value -were, the ultimate deWnition. Indeed, this core or primitive item, I am -going to suggest, is certainly not a political value, and perhaps not a value -at all. But it can, and must, explain how these various accounts of the -value of freedom are elaborations of the same thing, that these various -interpretations are not just talking past each other. -There is another consideration which the familiar philosophical dis- -putes and attempts at deWnition indeed take for granted, but they do not -give the right weight to it. In the sense that concerns these discussions, -freedom is a political value. (They are not addressing, for instance, meta- -physical questions about the freedom of the will.) I am going to suggest -that this point itself, when it is properly understood, has a very signiW- -cant eVect on the kind of construction we should be trying to achieve. In -particular, we must take seriously the point that because it is a political -value, the most important disagreements that surround it are political -disagreements. What kinds or registers of politics are involved, what the -relevant understanding of politics will be, will depend on which disagree- -ments are at issue—those within our own society, for instance, or those -with other societies. But our overall construction of freedom as a politi- -cal value must allow the fact that it is a political value to be central and -intelligible. -I am certainly not going to oVer a deWnition or any general character- -ization of the political. That would once again be impossible. But it may -be helpful to mention now four things I believe to be true about the po- -litical, which will shape the discussion and aVect my overall construc- -tion of freedom as a political value. -(a) First, a point about philosophy: political philosophy is not just ap- -plied moral philosophy, which is what in our culture it is often taken to -be.1 Nor is it just a branch of legal philosophy, a point that will concern us -later. In particular, political philosophy must use distinctively political -concepts, such as power, and its normative relative, legitimation. -Philosophy & Public AVairs -6 -(b) The idea of the political is to an important degree focused in the -idea of political disagreement; and political disagreement is signiWcantly -diVerent from moral disagreement. Moral disagreement is characterized -by a class of considerations, by the kinds of reasons that are brought to -bear on a decision. Political disagreement is identiWed by a Weld of appli- -cation—eventually, about what should be done under political author- -ity, in particular through the deployment of state power. The reasons that -go into political decisions and arguments that bear on them may be of -very various kinds. Because of this, political disagreement is not merely -moral disagreement, and it need not necessarily involve it, though it may -do so; equally, it need not necessarily be a disagreement simply of inter- -ests, though of course it may be. -(c) Possible political disagreements include disagreements about the -interpretation of political values, such as freedom, equality, or justice. -These disagreements may involve many diVerent kinds of understand- -ing and political traditions; they can tap into various areas of what I called -the historical deposit. It follows that the relation of these values to each -other cannot be established on the model of interpreting a constitution, -where questions typically take the form of determining what counts as, -say, limiting the freedom of speech. Of course, there is such an activity, -and it plays an important part in some cultures, such as that of the United -States. But even in those cases, it would be a mistake to equate political -thought about questions of principle with thought about actual or ideal -constitutional interpretation.2 We and our political opponents—even our -opponents in one polity, let alone those in others—are not just trying to -read one text. This will be an important point in what follows. -(d) The last of these preliminary signals is provided by that word “op- -ponents.” Carl Schmitt famously said that the fundamental political re- -lation was that of friend and enemy.3 This is an ambiguous remark, and -it can take on a rather sinister tone granted the history of Schmitt’s own -relations to the Weimar Republic and eventually to the Third Reich. But -it is basically true in at least this sense, that political diVerence is of the -From Freedom to Liberty: -7 -The Construction of a -Political Value -essence of politics, and political diVerence is a relation of political oppo- -sition, rather than, in itself, a relation of intellectual or interpretative dis- -agreement. Many things can be covered by the idea of “opposition” it- -self. But they all bring with them the question of how we understand our -opponents, how far our opposition is a matter of interests, how far a -matter of principle, what sentiments are engaged, why we and they feel -so strongly about it if we do, and in what ways we each diVerently tap -into the historical deposit. We may for various reasons think that our -opponents are, among other things, in intellectual error, but the rela- -tions of political opposition cannot simply be understood in terms of -intellectual error. Our construction of freedom as a political value must -make sense of the fact that disagreements involving that value are typi- -cally matters of political opposition, and that this carries substantial -implications about the ways in which we should regard the disagreement, -and regard our opponents themselves. -II. PRIMITIVE FREEDOM -Some of the arguments I shall consider are, inevitably, very familiar. My -excuse for putting on parade some of the usual suspects from Political -Philosophy 101 is rather like that which Descartes oVered when he ex- -cused himself for “warming up the stale cabbage” of ancient skeptical -arguments.4 He admitted that the materials were very familiar, but he -thought that it made all the diVerence what you wanted to do with them. -They had to serve a particular method, and he wanted to illustrate that -method. More modestly, my aim is the same: the usual suspects have to -be put to work, but on a rather diVerent task. -Mill, in Chapter 5 of On Liberty, says, informally enough: “liberty con- -sists in doing what one desires.” He cannot quite mean this: he must at -any rate mean the capacity to do what one desires (you are not unfree if -you simply choose not to do something you desire.) Amended in this -way, Mill agrees with Locke: “Liberty, ‘tis plain, consists in a power to do -or not to do; to do or forbear doing as we will. This cannot be denied.”5 -This is an idea of liberty as ability or capacity. It has an obvious disad- -vantage: we already have a concept of ability or capacity, and on this -Philosophy & Public AVairs -8 -showing ‘liberty’ or ‘freedom’ turn out boringly just to be other names -for it. More importantly, it misses the point of why we want these terms -in the Wrst place. That point is registered for the Wrst time when we add to -this kind of account a further condition, which concerns the kind of ob- -stacle that is stopping us from doing something we want to do. We say, -more narrowly, that we are unfree if our inability is the product, speciW- -cally, of coercion, where that is taken, at least in the central cases, to -mean—using the term ‘coercion’ in a broad sense—the intentional ob- -structive activities of other people. This is incorporated in Isaiah Berlin’s -famous account of “negative” liberty, and of course, as he noted, it goes -back a long way.6 Berlin quotes, for one, Helvétius: “The free man is the -man who is not in irons, nor imprisoned in a gaol, nor terrorized like a -slave by the fear of punishment … it is not lack of freedom, not to Xy like -an eagle or swim like a whale.” Though I shall be concerned with what -Berlin called “negative freedom,” I shall not use that term nor discuss the -distinction between “negative” and “positive” freedom itself. (It is mis- -leading in several respects, particularly if it is identiWed, as it is some- -times by Berlin, with a distinction between “freedom from” and “free- -dom to”.)7 The simple idea of being unobstructed in doing what you want -by some form of humanly imposed coercion, I shall call “primitive free- -dom.” -The range of obstacles, those identiWed with “coercion,” can itself be -interpreted more or less broadly. Some candidates, ordered roughly from -the obvious and agreed to the more disputable, are: -(A) Prevention by force (Helvétius’s irons and gaol); -(B) Threats of force, penalties, social rejection, and so forth -(Helvétius’s fear of punishment);8 -(C) Competition in (something like) a zero-sum game, where one -competitor sets out to stop another reaching his goal; -From Freedom to Liberty: -9 -The Construction of a -Political Value -(D) By-products of another enterprise, not aimed at the agent; -(E) By-products of an arrangement which structurally disadvantages -(those in the position of) the agent. -Some of these variations will concern us later. There is an obvious di- -vision in the list, between cases in which an agent’s activities are deliber- -ately directed against another agent’s capacity to do something, and those -in which they merely bring about that the agent loses that capacity. There -is a further extension beyond this, where what is in question is someone’s -omission or failure to remove an obstacle to the other agent’s capacity. -However, this requires more background, in particular the political frame- -work at which we shall eventually arrive, to make it reasonable to say -that the person in question has “failed” or “omitted” to do something -about this obstacle—that is to say, that this person should do something -about it. The more it can be said that there is a person or agency in this -position, the wider the range of complaints in freedom may be. How- -ever, we should not conclude from this that we should drop the refer- -ence to coercive or limiting action altogether and revert to the concep- -tion of freedom as simply power or capacity.9 We shall come back shortly -to the basic question of why this restriction to obstacles that are intended -by other agents, or created by them, or at the very least not removed by -them, should be so signiWcant. -III. FREEDOM AS A RATIO CONCEPT -First, however, there is a diVerent point to be made about primitive free- -dom. Primitive freedom is a ratio concept: it is a matter of the ratio be- -tween what people desire to do and what they are prevented by others -from doing. This implies that there are two ways to increase people’s free- -dom. I may remove the forces or obstacles that prevent them from satis- -fying their desires. But equally I may bring it about that they do not have -desires that cannot be satisWed. This leads to a paradox. Suppose, im- -plausibly and for the sake of argument, that there were a body of entirely -contented slaves. They are not physically abused, and they do not want -to do any of the things their slavery prevents them from doing. Under -this concept of freedom, they are free. If reformers appear and tell them -Philosophy & Public AVairs -10 -what they are missing and make them for the Wrst time discontented, it -might even be said that it is the reformers who have taken away their -freedom. A concept of freedom that leads to this cannot be adequate. -One reaction to this is to say that freedom should be measured not in -terms of what people actually desire, but in terms of what they should -reasonably, properly, or appropriately desire. This idea can take various -forms. It can also be applied not just to a deWcit of appropriate desire, as -in the case of the slaves, but to an excess of inappropriate desire, as in- -deed it has been by moralists in the Stoic tradition. The construction of -freedom as a political value should certainly leave room for arguments -of this form: besides the familiar answer to a complaint in freedom, that -the constraint on desire is necessary (for instance in the interests of oth- -ers), there is a possible answer in some cases that the desire is unreason- -able and the agent would be better oV without it. In particular, he would -be more free. But as a general principle of argument, this runs the risk of -heading in the direction of what Berlin called “positive freedom”: at the -limit, the argument will be heard that coercive force can be justiWed to -prevent the formation of inappropriate desires or to encourage the for- -mation of appropriate ones, so that people, as Rousseau put it, can be -forced to be free. That notorious phrase has rightly been seen as para- -doxical.10 What is true, though, is that this kind of idea is not simply an -arbitrary appropriation of the word “freedom”—it is rooted in certain -features of the concept, although it develops them in an irresponsible -way.11 -There is another way of dealing with the ratio paradox, which appeals -not to a normatively approved list of desires, but rather to some special -explanations of why people do not have certain desires they might be -expected to have. So in the slave case, the absence of a desire for free- -dom may diagnosed as itself a product of coercion: it is precisely be- -cause of the way in which they are treated, prevented from hearing of -other options and so on, that the state of their desires is as it is. The idea -of this is the same as that employed in the Critical Theory test for beliefs -which supposedly legitimate some prima facie oppressive institution: -From Freedom to Liberty: -11 -The Construction of a -Political Value -whether the belief is the product of the coercion which it supposedly -helps to legitimate. The principle of these tests seems entirely sound and -to Xow naturally from the structure of the idea of coercion; the problem -with them is of course going to lie in the prospects of making good an -interpretation in these terms in any given case. We shall come back to -the happy slaves later, and try to Wx rather more deWnitely where the Criti- -cal Theory test Wts into the construction of liberty. -IV. WHY COERCION? -Why should we pick on, speciWcally, primitive freedom, with its concen- -tration on human sources of constraint, as the starting point? The an- -swer is that primitive freedom is, as we might put it, a “proto-political” -concept. This does not merely mean that if we are interested in freedom -as a political value (as we are), this is the place to start. It means some- -thing stronger: that this is the place to start because it involves a quite -basic human phenomenon, and that phenomenon already points in the -direction of politics. -In a frequently quoted remark, Heracleitus said “They would not have -known the name of justice, if it had not been for these things,” and it is -virtually certain that “these things” are disputes, quarrels, and conXict.12 -Justice, hence an authoritative source of justice, hence an empowered -enforcer of justice, is needed to impose solutions on what would other- -wise be unbounded conXicts of interest. Similarly, the restriction of our -activities by the intentional activities of others, as contrasted with the -ubiquitous limitations we face in nature, can give rise to a quite speciWc -reaction, resentment; and if resentment is not to express itself in more -conXict, non-cooperation, and dissolution of social relations, an authori- -tative determination is needed of whose activities should have priority -(needless to say, that determination itself may well use concepts of jus- -tice.) In an appropriate context, resentment can be directed to inaction, -to a refusal to remove some obstacle if it can be claimed that it is the -other party’s business to remove it. But it cannot extend to what are rec- -ognized as blankly the obstacles of nature. Rousseau’s distinction be- -tween being conWned in one’s house by a snowstorm and being locked -in it by someone else remains in place.13 -Philosophy & Public AVairs -12 -But now there is a further development peculiarly connected with free- -dom. As soon as the authoritative source is indeed empowered and de- -ploys coercion to enforce its rulings, that coercion itself can give rise to -resentment. Questions arise of how that power is being used, questions -that demand legitimating accounts. Those questions are likely to become -more pressing, the closer the situation comes to that in which the au- -thority uniquely commands the means of some kinds of coercion (such -as (A) above, and to some extent (B))—that is to say, the closer it comes -to the ideal type of there being a state. To various degrees in diVerent -societies, these questions will be the subject of discussion. The political, -in some of its many forms, now exists. -V. TOWARDS LIBERTY -We do not yet have freedom as a political value: a political value which -from now on, making a distinction I have not used up to now, I shall call -liberty. -Primitive freedom is not itself that political value.14 We can see this by -considering an idea which arises as soon as we have the conditions of -the political, that is to say, an authority, together with appeals to that -authority. This is the idea of a claim in liberty. The following points are -obvious: -(a) No one can intelligibly make a claim against others simply on -the ground that the activities of those others restrict his primitive free- -dom, or that the extension of his primitive freedom requires action by -them. At best, that is the start of a quarrel, not a claim to its solution. -(b) Similarly, no sane person can expect that his primitive freedom -merely as such should be protected. -(c) Equally, suppose that someone uses the notion of a right: no sane -person can think that he has a right against others to what is demanded -by his primitive freedom as such (i.e., to anything he happens to want.) -(d) A similar point can be made in terms of the good: no one can -intelligibly think that it is good (period, as opposed to good for him) that -his primitive freedom should be unlimited. -The eVect of these points is that the resolution of questions of how far -a person’s freedom should be protected or extended, how far it is good -From Freedom to Liberty: -13 -The Construction of a -Political Value -that it should be, how far he has a right that it should be, requires some -degree of impartiality (a general point of view, in Hume’s phrase) which -is not contained in the idea of an individual’s primitive freedom as such. -The importance of these points has been emphasized by Ronald -Dworkin.15 However, he assumes that a claim in liberty must be a claim -to a speciWc kind of right to do the thing in question, such as a right of -free speech. He concludes from this that there can be no conXicts be- -tween liberty, properly understood, and any rightful claim. For suppose -some other value, such as equality or more generally justice, when prop- -erly interpreted, requires that I not do a certain thing. Then I have no -right to do that thing. So I cannot correctly make a claim in liberty to do -it, and so, if I am prevented from doing it, this cannot be a restriction on -my liberty (though it is of course a restriction of primitive freedom.) -It cannot be necessary that this conclusion should follow from the -understanding of liberty. Indeed, in my view, it is necessary that it should -not follow. We are constructing liberty as a political value, which means -among other things that we can make sense of its role in political argu- -ment and political conXict, and generally of the experience of life under -a political order. It is one datum of that experience that people can even -recognize a restriction as rightful under some political value such as -equality or justice, and nevertheless regard it as a restriction on liberty. -The notion of a cost in liberty is at least as well entrenched in historical -and contemporary experience as that of a rightful claim in liberty. -This notion of a cost in liberty can apply, I just suggested, even to -people who agree with some restrictive measure, introduced for instance -in the interests of equality—they can still regard it as a restriction on lib- -erty, though a justiWed one. Dworkin’s view cannot make sense of the -attitude of such people: on his view, they are merely confused. But the -point about a cost in liberty applies even more signiWcantly to those who -do not agree that the cost is necessary. The state enacts, by quite proper -process, some measure in the name of equality, say, which restricts the -activities of some people. Those people oppose it, and let us suppose -that they oppose it on principle: they do not accept the ideal of equality, -or this application of it, or this way of going about it. They certainly re- -Philosophy & Public AVairs -14 -gard the measure as a restriction on their liberty. Dworkin’s view can in -its own terms give a coherent account of this reaction (they do not think -the measure is rightful), but it now raises a diVerent question: suppose -we are supporters of the measure, what attitude should we take towards -the people who have this reaction, our political opponents? Since we think -that they are wrong in opposing the measure, speciWcally in denying that -the measure is justiWed in the name of equality, we must suppose, on -Dworkin’s view, that they are wrong in thinking that their liberty is being -restricted. They are coerced by the state, they resent it, they vividly think -that their liberty is curtailed. Dworkin patiently explains to them that -they are simply wrong in thinking this; they may think that there is a cost -in liberty, their liberty, but there is not. This is exactly the attitude that -Rousseau thought appropriate, and it seems to me just as objectionable -now as it was with him. -We should take seriously the idea that if, under certain conditions, -people think that there is a cost in liberty, then there is. Taking that idea -seriously, I suggest, is a condition not only of taking seriously the idea of -political opposition, but of taking our political opponents themselves -seriously. -There is one class of complainants about costs in liberty whom, I think, -we need not take seriously: those who complain that their liberty, or in- -deed their primitive freedom, is curtailed by the mere existence of a state. -Certainly not their liberty: since liberty is freedom as a political value, no -complaint is a complaint in liberty if it would apply to any political sys- -tem or any state whatsoever, so the existence of the state is not itself an -oVense against or limitation on liberty (though some particular forms of -the state may of course readily be so.) Moreover, this is not simply a ver- -bal point about the understanding of “liberty”; we need not agree, either, -that the fact that a person is subject to a state is, in itself, a limitation on -his primitive freedom. The reason for this is that the amount of freedom -that a person would have without the state is entirely indeterminate or, -at any rate, very small. Two conclusions follow about anarchism: from -the point about liberty it follows that is not a political position, and from -the point about primitive freedom, that it is not interesting, and I hap- -pily accept both these conclusions. -From Freedom to Liberty: -15 -The Construction of a -Political Value -VI. BEYOND CLAIMS IN LIBERTY -The Rousseau outlook (as we might call it) fails to make sense of an en- -tirely familiar reaction that is basic to politics and to the understanding -of political opposition. For that reason, it does not encourage a helpful— -one might say, healthy—relation to one’s opponents. What we should take -seriously are their reactions, or at least their deeper reactions, rather than -the extent to which we are disposed to share or morally approve of their -reactions, and this applies in diVerent forms whether they are opponents -outside our polity or opponents within it. There is a potentially instruc- -tional, potentially patronizing, element in the Rousseau outlook which, -to take just the case of local opponents, is hostile to the relations of fel- -low citizenship which we must hope can co-exist with political opposi- -tion — so long at least that we believe that there should be one polity and -political opposition has not irreparably divided it. Indeed, this moral- -ized outlook in some of its more spectacular historical expressions, such -as the Terror, has shown that it can destroy not just citizenship but citi- -zens. -The philosophical fault at the heart of this outlook might be said to be -this, that it bases the idea of liberty on that of a rightful claim in liberty. -The notion of a claim in liberty, I have said, is useful in distinguishing -liberty from primitive freedom in the Wrst place. It can do this because -any adequate idea of liberty must at any rate accommodate the idea of a -claim in liberty, and the idea of primitive freedom, in itself, cannot do so -at all. But the idea of a rightful claim in liberty implies a juridical concep- -tion, of an agreed authority which can rightfully grant or refuse such a -claim, and political opponents do not necessarily understand their situ- -ation in these terms. As I put it earlier, they are not all interpreting the -same text. -In the case of opponents in diVerent political systems, they may not -agree on terms in which such an authority, if they imagined it to exist, -might legitimate its decisions to them. Between opponents who share a -polity and neither of whom wants to destroy it, they will agree on an au- -thority or process which decides what will happen, but this is not at all -equivalent to the authority’s deciding that one or another claim in lib- -Philosophy & Public AVairs -16 -erty is rightful. The reason for this lies in a characteristic of the political -that I mentioned before, that political disagreements are not identiWed -through the kinds of reasons that are deployed in them. The reasons for -which an agreed political authority decides what will happen are vari- -ous, and the decision in various ways may aVect people’s liberty, but the -decision is not itself an announcement of what is a rightful claim in lib- -erty. -In the very special case of a polity that has an institution of judicial -review, executive and legislative decisions can be checked against claims -in liberty. In such a state, some political decisions, in the widest sense, -are judicial ones: i.e., the decision which decides what will happen is -made for judicial reasons. (This is not the same as the familiar charge, in -criticism of such an institution or of its operation, that some of these -judicial decisions are, in a narrower sense, political ones.) But even here -the sense that one’s liberty is restricted by a decision cannot be identi- -Wed with the thought that the court, if it acted rightly, would grant or -would have granted or indeed should have granted one’s claim in liberty. -One may agree that the court, if it was doing its job properly, would not -have granted such a claim, but one can still feel that the decision re- -stricts or even violates one’s liberty. First, the court itself may accept that -its decision, though rightful, involves a cost in liberty.16 A more general -reason, however, is that judicial reasons, the kinds of reason that a con- -stitutional court, however inventive, must attend to, are only one kind of -reason. (Even those such as Dworkin who think that judicial review should -include explicit and wide-ranging moral reasons accept that since these -are decisions within a given legal system, they are bound by other con- -straints, such as stare decisis.) So the person who feels his liberty injured -may feel this in virtue of other reasons, indeed other reasons of prin- -ciple, which he does not suppose would vindicate a claim of right in the -judicial forum. If he is angry at the outcome, then the focus of his anger -might be this, that things are such that the Wnal court of appeal must -rightfully decide against him, and this thought might survive the under- -standing that given the legal history and the court’s situation there was -no realistic alternative to things being this way. -From Freedom to Liberty: -17 -The Construction of a -Political Value -The thought that an action, say a political decision, involves a cost in -one’s liberty does not necessarily involve the thought that one would have -a rightful claim in liberty before some speciWed or indeed unspeciWed -authority. So what does go into the idea of a cost in liberty? We should -recall that we are trying to construct this idea as part of constructing an -idea of liberty itself which will serve our needs. The construction started -from certain experiences associated with perceived limitations on primi- -tive freedom. We should turn back to that again, and approach the con- -struction of the idea of a cost in liberty by considering what it is to feel -that something involves a cost in one’s liberty. -VII. RESENTMENT AND OTHER SUCH REACTIONS -When I considered in the Wrst place the transition from primitive free- -dom to liberty, I said that the reaction to coercion in the most elemen- -tary case was resentment. But the experience of feeling that one’s liberty -is being restricted need not necessarily take the form of resentment. How -far it can be expected to do so is not an easy question to pursue, because -resentment so readily merges into other negative feelings, such as anger -and dislike, not just for conceptual but also for various familiar psycho- -logical reasons. In relation to freedom, the primitive and purest case of -resentment is perhaps that in which another person acts manifestly and -eVectively in a way that prevents me from doing what I want, and does -so with that intention, and I think, moreover, that there is nothing to be -said at all in favor of his doing so from any point of view except his. There -are of course many cases of resentment in which this strongest condi- -tion is not satisWed. I may think, for instance, that the action was in my -long-term interests, even that it was done with that intention, and still I -may resent it. (Of course there may be a problem in such a case of sort- -ing out what exactly it is that I resent—I may just resent, for instance, the -fact that he took for granted his own ideas about my interests.) -It is usually said that the particular reaction of resentment is tied to -the idea of the other person’s action being not rightful. If we accept this -idea, and also identify as (necessarily) resentment the feelings that go -with a sense of a restriction on one’s liberty, we shall be back on the road -to Rousseau’s outlook. But I think that we should loosen both these con- -nections. Resentment is not so closely tied to the idea of right,17 and a -Philosophy & Public AVairs -18 -sense of coercion or restricted liberty can be connected to reactions that -range more widely than resentment in the strictest sense. A helpful con- -sideration here is the extent to which the person whose liberty is in ques- -tion is identiWed with the actions that might be felt to restrict or violate -that liberty. This idea helps us to explain the case of the citizen who thinks -that a certain political decision is both procedurally correct and right in -principle, but nevertheless experiences its consequences for himself as -a cost in liberty. The reason that this is possible is that his sense of him- -self is not entirely that of a person identiWed with the state’s decisions, -however rightful. Rousseau of course wanted each person in a virtuous -republic to be identiWed totally with himself or herself as citizen, but it is -inevitable and appropriate and an entirely good thing that on any con- -ception of a modern society—and I suspect also, on a realistic concep- -tion of any society whatsoever—this is not going to be so.18 -Someone who disapproves of a measure in principle but not on pro- -cedural grounds is less identiWed with it than someone who approves of -it in both these respects. Someone who Wnds it both procedurally and in -principle objectionable is even less identiWed with it, and one who thinks -that all the procedures are a sham is less identiWed still. At the end of this -line, when the action that constrains someone is experienced as nothing -but coercion, sheer force in the interests of others, the lack of identiWca- -tion is total, and this certainly is resentment. But right from the begin- -ning of this progression there is room for the idea that the action, what- -ever there is to be said for it, is a limitation of someone’s liberty, to the -extent that he identiWes with the desires and projects which this action -will frustrate. -It is not a necessary condition of there being a cost in someone’s lib- -erty or a restriction of it that he has such experiences of resentment, frus- -tration, or whatever. This takes us back to a point we noticed earlier in -this construction, in the example of the happy slaves. We deplore their -by an action projects on it the idea that it is not rightful. But then the idea of right must be -salient in those particular cases, precisely because the reaction is identiWed as a moralistic -rationalization. We can recognize resentment in less moralized circumstances: for instance, -where A bears a grudge against B because B beat him (fairly) in a contest. -From Freedom to Liberty: -19 -The Construction of a -Political Value -lack of liberty; they—we are fancifully supposing—do not. But if they do -not, is there anything, on the present line, on which we can build our -complaint? I suggested earlier that there is, in what I called the Critical -Theory principle. The slaves are subject to a regime which (simply as a -matter of fact) would pursue much the same objectives whatever they -desired. We are supposing that they do not experience any frustration, -although they are not allowed to satisfy some desires that human beings -in general might be expected to have (e.g., they cannot marry or travel or -stop work.) In actual fact, of course, it is very unlikely that they will not -feel frustrated in these respects, which is what makes this a rather objec- -tionable fantasy, but suppose it to be so. In addition, they do not have -certain other desires or aspirations which others have in those historical -circumstances, such as a desire for political representation. In both re- -spects, the state of their desires is identiWably a product of that regime, a -regime, moreover, which would not be responsive even if they had the -desires in question. In those circumstances, the absence of the desires -does not refute the complaint in liberty, once it is made; if anything, it -gives it extra force. It is the Critical Theory principle that explains, I think, -why a complaint in liberty is not turned away in such a situation, and -hence why the presence of frustrated desire is not a necessary condition -of a cost in liberty.19 -VIII. LIBERTY NOW -Let us try to assemble some conditions on liberty. We may recall -(i) A practice is not a violation of liberty if it is necessarily involved -in there being a state at all. -However, -(ii) The principle of (i) cannot be relativized to a particular state or -polity, since particular states or polities can obviously be criticized for -violations or undue restrictions of liberty. At the same time, there is -Philosophy & Public AVairs -20 -limited interest in comparing all existing states to some ideal model -of a state. In particular, what desires or frustrations people might have -under increasingly counterfactual conditions is increasingly indeter- -minate. Utopian political discourse is of course possible and may have -its uses, but it is at best obliquely related to arguments about the lib- -erty we can hope to Wnd in our world. This is not to say that Utopian -discourses about liberty are analytically or deWnitionally incoherent. -In terms of the broadest construction of liberty, we can Wnd a place -for some of them, if they are not otherwise too incoherent. But they, -and the comparisons they invite with the actual, do not do much for -the more speciWc construction of liberty as a value for us. -In pursuing that construction, it seems to me that we should restrict the -Utopia factor by accepting in particular that -(iii) Modernity is a basic category of social and hence of political un- -derstanding, and so a politically useful construction of liberty for us -should take the most general conditions of modernity as given. This -was the lesson of Benjamin Constant’s marvelous speech, given in 1819, -The Liberty of the Ancients compared with that of the Moderns,20 in -which he pointed out that whatever the merits for an ancient republic -of a concept of liberty linked to republican virtue, they were essen- -tially limited to the conditions of an ancient republic, and only disas- -ter could follow, as indeed it had followed in France, from trying to -apply such an ideal to a modern commercial society. -Of course there is room for much argument about what the condi- -tions of modernity are, what forms a modern society can intelligibly take, -and so on: but that is as it should be, for that is the substance of much -signiWcant political argument. But granted in a general sense the condi- -tions of modernity as shaping the construction of our idea of liberty, there -will be a variety of consequences. For instance, I mentioned earlier a -range of things that can count as coercive restrictions on an agent’s do- -ing what he wants, intentional activities of others that can count as lim- -iting freedom. In the context of modernity, it will be clear why in general -factor (C) above, the eVects of competition in something like a zero-sum -From Freedom to Liberty: -21 -The Construction of a -Political Value -game, will not count, because competition is integral to the social sys- -tem. -This is not to deny that there can be political arguments to the eVect -that certain kinds of competition are so damaging to the general inter- -est, and perhaps to the interests of losers, that they need to be controlled: -it is merely that these are not per se arguments based on the losers’ lib- -erty. Rather similarly, factor (D) above—by-products of another enter- -prise not aimed at the person in question—do not presumptively count -as limiting that person’s liberty, though there are many special cases in -which they do so. This is because they are a ubiquitous phenomenon -essentially connected with the society’s central activities. Factor (E), on -the other hand, arrangements which structurally limit the opportunities -of some class of citizens, are more likely to count, and complaints about -power structures which have such eVects are readily understood as com- -plaints in liberty. This is because we have a better and typically modern -understanding of such power structures, and, we hope, some achievable -means of changing the situation. -Granted that a person’s complaint that he has sustained a cost of lib- -erty lies within such limits implicit in the conditions of modernity, how- -ever exactly we understand them; granted the wider condition (i), that -the restriction is not one that would be necessary under any state; and -granted of course that it is factually correct, that is to say that his desires -really are frustrated or limited by the activities about which he is com- -plaining; then we should accept the idea that emerged from the earlier -arguments, that if someone feels that some action or arrangement im- -poses on him a cost in liberty, then it does indeed do so. This does not -mean, of course, that the action or arrangement should not be allowed: -the cost in his liberty is very often outweighed by the values served by -the action or arrangement. Moreover, it need not justify or call for any -compensation. He need not have a claim in liberty in any court. But a -cost in liberty is still what it is, even if he quite properly has to carry the -cost himself. -A construction of liberty on these lines might be thought to spread -the idea of a cost in liberty too wide. It means that, within certain limits, -anyone with a grievance or who is frustrated by others’ actions can ap- -propriately complain about restrictions on his liberty. If “appropriately” -means that it is semantically, conceptually, indeed psychologically, in- -Philosophy & Public AVairs -22 -telligible that he should do so, that is right. If it means that it is necessar- -ily useful, helpful, to be taken seriously as a contribution to political de- -bate, and not a waste of everyone’s time, it is not right. The point is that -these latter considerations are in the broadest sense political consider- -ations, and that is the point of the construction. -The conditions I have suggested for complaints of the loss of liberty -might be expressed in terms of “realism.” A form of liberty that could not -be oVered by any state is an entirely unrealistic basis of objection, and -the limitation to the conditions of modernity implies a further step to- -wards a realistic political position or claim, which can be taken seriously. -It may be said that there are two diVerent questions here, which this ap- -proach runs together: whether it is true that someone has sustained a -cost in liberty and whether it is sensible, useful, reasonable, or sane to -complain about it. These ideas are indeed not the same. It is not a reason -for supposing that there has been no loss of liberty, that it is not politi- -cally prudent to say that there has been: the loss of liberty lies in the -good sense attached to the resentment, not in the good sense or other- -wise of expressing it. However, what it is reasonable to count as some- -thing that it is sensible for someone to resent is a matter of one’s overall -view of the political world, and so, while the two ideas are certainly dis- -tinct, there is an extensive area in which they overlap, and a properly -political conception of liberty acknowledges this. Resentment about the -loss of liberty, like resentment about anything else, implies the thought -of an alternative world in which that loss does not occur, and just be- -cause liberty is a political value, the distance of that possible world from -the actual world must be measured in terms of political considerations -of relevance and practical intelligibility. The world of the anarchists is -too far away—too far away from anything—to ground complaints in lib- -erty at all. Many complaints that Xy in the face of modernity equally do -not even cross the threshold of oVering a serious political consideration. -It is also true, of course, that even if “Utopian politics” is a contradic- -tion in terms, “Utopian political thought” is not, and someone may make -a case for taking seriously complaints in liberty that would not get a hear- -ing in everyday political activity. He may show that some dimension of -resentment is more sensible than conventional opinion supposes; or he -may, just as eVectively or more so, claim that whether it is what people -call “sensible” is not the point. The aim, he may rightly say, is to change -the world, and his elevation of his or others’ resentment into a complaint -From Freedom to Liberty: -23 -The Construction of a -Political Value -about liberty may indeed succeed in making it into a complaint about -liberty. -What we should be arguing about with such a complainant, if it is worth -arguing with him at all, is whether it is in the least sensible for him to -expect that a desire of that kind should not be frustrated; whether his -conception of a social world in which it would not be frustrated is not a -fantasy, either in general or in relation to historical circumstances in -which he necessarily Wnds himself; whether, on reXection, he does not -identify more deeply with the considerations that justify the coercion -than with his original desire. These are the materials of political persua- -sion, in the broadest sense, and this is what we should be engaged in. A -major aim of constructing liberty in the way I have suggested is that it -should leave space in which these arguments can take place. -There is a further and benevolent consequence. He may indeed per- -suade us our sense of what is “realistic” will change, and with it, the di- -mensions of liberty. But if, on the other hand, our persuasions succeed, -he will cease to feel the frustration. His resentment will go away. He may -come to identify fully with the grounds of coercion in such a case; he -may cease to desire what he originally desired; in any case he will not -care any more that he cannot have what he desires. If this happens, then, -on the construction I am oVering, there will be no frustrated desire (and -not for reasons that fail the Critical Theory test); so his liberty will no -longer be restricted, and there will no longer be a cost in liberty. -IX. THE VALUE OF LIBERTY -Someone may ask why liberty is a value at all. This might mean, why is -liberty in any of the various constructions that have been given of it in -diVerent historical circumstances a value at all? Why should human be- -ings in general be concerned with some value of that form? I do not know -that I can answer that question, beyond suggesting a set of questions to -put in its place: What view would one have to take of one’s desires and -projects and other values if there were never even a question of its being -something to be resented and resisted if others aimed to frustrate them? -What view would one have to take of those others, in particular of a po- -litical authority, for that question never to arise? -A better question might be: why is liberty the special value it is for us? -Why does it play the particular role that it does in our political thought -Philosophy & Public AVairs -24 -and aspirations? In particular, why is it so important? That question must -be directed to liberty under the kind of construction that is appropriate -to our circumstances, and one answer to it, an “internal” answer, will lie -in inviting the questioner to think about liberty in terms of those cir- -cumstances and in relation to other political values and beliefs that be- -long to our world. We invite him to acknowledge who and where he is, -and ask him what alternative he has to this structure of ideas and at what -Utopian distance the alternative, and the political arrangements that -might go with it, lie from the world in which we and he all live. We can -argue about the merits of those other arrangements, and this will be, -once more, a political argument, one that works with the materials which, -in this condition, he and we can use. -This is Wne, so far as it goes. Yet there is something unsatisfactory about -saying just this much. On the one hand, we are insisting that if we are to -think realistically about political values, we must do so, so to speak, from -here. At the same time, indeed in making this very statement, we seem to -acknowledge that “here” is just one place among others: that we can con- -sider the modern condition, our condition, to some extent from the out- -side and compare it with others. If we can do that, then we should be -able to say rather more than we have said about this modern construc- -tion of liberty, and its value, as compared with others. This touches on a -familiar point which I mentioned very brieXy before. One of the most -prominent characteristics of modernity is its historical self-conscious- -ness, and that carries with it certain demands on how we understand -ourselves. What we have said to this questioner so far does not seem to -do enough to meet those demands. Can we do any more? -Perhaps we can. In conclusion, I shall try to sketch in the barest out- -line some more that we might say. To do so, I must go back for a last time -to primitive freedom and its being, as I put it, a “proto-political” con- -cept. I argued that primitive freedom is not itself a political value (and -perhaps not a value of any kind). This is because the notion of a political -value implies an impartial standpoint to determine the priority of diVer- -ent agents’ desires, a standpoint which is not given simply by the idea of -each person’s desires. That standpoint must be that of an authority with -a power to enforce. Once we have such an authority, I said, the question -of freedom and coercion arises again, now in relation to the coercion -which the authority exerts. If this is not to be merely another contribu- -tion to conXict, the authority must have authority; and this means that -From Freedom to Liberty: -25 -The Construction of a -Political Value -in some terms or other, it must be acknowledged as legitimate. Let us -now say there is need for legitimate government (where this means that -it is counted or recognized as legitimate in a given society, not that we -would necessarily accept it by our standards of legitimacy). -I take it that the following is a universal truth: legitimate government -is not just coercive power. It is true, moreover, in the sense of “legiti- -macy” I am using, in which the idea is relativized to local understand- -ings: everyone everywhere where there is such a thing as government -recognizes some distinction between legitimate government and a mere -conspiracy of eVective coercion, even if many people have lived and do -live under such a conspiracy or in a state which is not much more. For -there to be legitimate government, there must be a legitimation story, -which explains why state power can be used to coerce some people rather -than others and to allow people to restrict other people’s freedom in some -ways rather than others. Moreover, this story is supposed to legitimate -the arrangements to each citizen, that is to say, to each person from whom -the state expects allegiance; though there may be other people within -the state, slaves or captives, who are nakedly the objects of coercion and -for whom there is no such legitimation story.21 -The fact that everywhere there is a legitimation story to be told to each -citizen does not imply, of course, that in terms of the story there is some -presumption that citizens should be treated equally. Most such stories -in the past have delivered various forms of inequality and hierarchy, with -corresponding constraints on the activities of some citizens in relation -to other citizens and to the state itself. The fact that there is a legitima- -tion story to be told is indeed enough to distinguish these societies as -examples of legitimate government, in contrast to mere successful ex- -amples of banditry. The signiWcant point for us, however, and for our -construction of liberty and the value we attach to it, is that we do not -believe these stories, and it is a notable feature of modernity that we do -not. I do not mean merely that we do not accept the stories as legitimat- -ing stories for us. I mean that to a considerable degree we regard the -Philosophy & Public AVairs -26 -content of these stories, in particular those that involve religious or other -transcendental justiWcations, as simply untrue. It follows—or would fol- -low with much further argument—that in telling our own legitimation -story we start, in a sense, with less. In interpreting and distributing lib- -erty we allow each citizen a stronger presumption in favor of what he or -she certainly wants, to carry out his or her own desires. -Of course the presumptions in favor of equal and extensive liberty in -modern societies are intimately connected with the central activities of -those societies, in particular their forms of economic organization. This -is an historical platitude, but by itself it will not help our questioner who -wanted to hear more of why we value liberty as we do. Something on the -lines of the absurdly rough sketch I just outlined can perhaps give him -more. The sketch indeed connects our construction of liberty, and the -value we give it under that construction, with the condition of moder- -nity, but it oVers more than the consideration (which is in itself a per- -fectly sound consideration) that this is our condition. It connects our -ideas of liberty with a universal truth, that everywhere legitimacy requires -more than mere coercion, and it adds to this the conviction that under -the conditions of modernity, whatever else may be worse, we at any rate -have a better grasp on the truth. I do not mean on the truth about lib- -erty—in relation to this questioner, that would be marching on the spot. -Rather, we have a grasp on truths that destroy those fantasies that once -provided the fabric of pre-modern legitimation stories. -If that account could be made good, it would yield the conclusion that -modern societies, or some of them, are rightly more concerned with lib- -erty and aim to deliver more of it than did earlier societies. Of course, the -liberty they aim to deliver is understood or constructed in terms appro- -priate to modernity, but that does not make their promise merely circu- -lar or empty. It is backed by the idea that whatever else they may have -taken away or made impossible, modern societies can oVer and perhaps -sustain a construction of liberty in which the constraints on it are fewer -and, above all, more truthfully motivated than in most societies of the -past. - - -References - -1. John Rawls has said in Political Liberalism (New York: Columbia University Press, 1993), -p. xvi, “In [A] Theory [of Justice] a moral conception of justice general in scope is not distin- -guished from a strictly political theory of justice,” and the aim of the later book is to give -such a political theory. But the later account still represents the political conception as -itself a moral conception, although one directed to a special subject matter (p. 11). It is -signiWcant how far moral conceptions still structure the theory: the solution to the central -problem of the stability of a just society, for instance, is worked out in terms of the moral -powers of its citizens. -2. The somewhat Manichean distinction between “principle” and “policy,” where the -latter is understood in consequentialist terms, is sometimes understood as roughly paral- -lel to that in the United States between the Supreme Court and the Congress. To the ex- -tremely limited extent that this is true, it can be regarded as a special product of history as -well as something of a misfortune. -3. Carl Schmitt, Das BegriV des Politischen translated as The Concept of the Political -(Chicago: University of Chicago Press, 1996). -4. Reply to the Second Set of Objections to the Meditations: The Philosophical Writings -of Descartes, vol. 2, translated by John Cottingham (Cambridge: Cambridge University Press, -1984), p. 94. -5. John Locke, Essay on Human Understanding, ii.1.56. -6. Isaiah Berlin, “Two Concepts of Liberty” (1958), reprinted in Four Essays on Liberty -(Oxford: Oxford University Press, 1969). -7. On the distinction between negative and positive freedom, see Gerald C. MacCallum, -Jr., “Negative and Positive Freedom,” Philosophical Review 76 (1967); John Rawls, A Theory -of Justice (Oxford: Clarendon Press, 1972), sec. 32. -8. Hobbes famously argued that such things do not reduce freedom, but merely raise -the cost of a particular course of action. Although it suited Hobbes’s purpose to treat this as -a consideration relevant to the theory of political freedom, it is better understood in the -context of an account of voluntary action: the fact that an action is coerced in this sense -does not mean, standardly, that it fails to be a fully intentional action. -9. As is argued by Raymond Geuss in History and Illusion in Politics (Cambridge: Cam- -bridge University Press, 2001), pp. 96–98. -10. Quentin Skinner (“The Paradoxes of Political Liberty,” in S. M. McMurrin, ed., Tan- -ner Lectures on Human Values VII [Salt Lake City: University of Utah Press, 1986]) points -out that this is not a paradox in the context of positive liberty theory. Indeed. But since it -is a paradox, that is a problem for the theory. -11. More irresponsibly than the tradition of republican liberty, which, as Skinner has -shown (“The Paradoxes of Political Liberty”), is something diVerent. It is not surprising, -however, that it should be suspect for some of the same reasons: see note 18. -12. Fragment B23, in Herman Diels and Walther Kranz, Die Fragmente der Vorsokratiker, -6th ed. (Berlin: Weidmann, 1951–52) -13. Geuss (History and Illusion in Politics) refers to this remark, p. 104, 108–9, but he -does not discuss it in relation to the argument mentioned above at note 9. -14. The following arguments suggest that it is not a value of any kind, but I shall not take -up that question here. -15. Ronald Dworkin, Sovereign Virtue (Cambridge, Mass: Harvard University Press, 2000), -ch. 3. It is fair to say that Dworkin’s disinclination to accept conXicts between liberty and -equality depends as much on his account of equality as on his account of liberty. I am -grateful to Dworkin for many discussions of this subject, which have done much to shape -the present discussion. -16. The U.S. Supreme Court itself implicitly accepts this when it engages in “balancing.” -An illustration is the “undue burden” test for the constitutionality of regulations on abor- -tion: Planned Parenthood v. Casey, 505 US 833 (1992). (I am indebted here and elsewhere to -Robert Post.) -17. The idea that resentment is grounded in thoughts about right is encouraged by the -familiar phenomenon of back-formation, in which someone who is merely disadvantaged -18. Here Rousseau’s outlook coincides with the tradition of republican virtue (see note 11 -above). The idea that in a virtuous ancient republic the constraint to engage in public ser- -vice did not involve a cost in liberty, if it implies anything about citizens’ actual reactions, -should surely be treated with some skepticism. If it says, rather, that because an ideally -rational citizen would not react in that way, those reactions do not count, republican lib- -erty will certainly court many of the same dangers as “positive liberty.” -19. It is not suggested that this is a suVicient account of a Critical Theory test. Obviously, -beliefs and states of desire can be quite properly the causal product of regimes to which -people have been exposed or even subjected: educational regimes, for instance. Further -questions are involved: partly, about the kinds of belief in question, and what they, or the -presence or absence of certain desires, are supposed to justify; partly, about the attitude -that the people would have to the beliefs or desires if they knew how they came about. I -discuss some of the problems involved in Telling and Truthfulness (Princeton: Princeton -University Press, forthcoming.) -20. See Benjamin Constant, Political Writings, ed. Biancamaria Fontana (Cambridge: -Cambridge University Press, 1988), p. 309 V. Cf in these connections “St Just’s Illusion,” in -my Making Sense of Humanity (Cambridge: Cambridge University Press, 1995). -21. I have claimed in Shame and Necessity (Berkeley: University of California Press, 1993), -ch. 5, that this was the situation with slavery in the ancient world, which was typically re- -garded as necessary rather than just: the Helots in Sparta were indeed explicitly under- -stood to be enemies in captivity. The racist justiWcations of modern slavery were presum- -ably meant in some sense to legitimate the institution; I am less clear how far they were -meant to legitimate it to the slaves. \ No newline at end of file diff --git a/doc/CHANGELOG.txt b/doc/CHANGELOG.txt index b8643ad..df0aabc 100644 --- a/doc/CHANGELOG.txt +++ b/doc/CHANGELOG.txt @@ -1,3 +1,6 @@ +100901 (done by Thang) +- Incorporate BiblioScrip (http://github.com/mromanello/BiblioScript) and BibUtils (http://www.scripps.edu/~cdputnam/software/bibutils/) + 100401e (done by Min on 100725) - Minor changes to paths and to make it work again from wing.nus directory (moved from forecite, due to restructuring of WING server) diff --git a/doc/index.html b/doc/index.html index 31965b6..ffb544b 100644 --- a/doc/index.html +++ b/doc/index.html @@ -97,10 +97,7 @@
    -
  • Current version: 100401d: Added - SectLabel (logical structure parsing) software from the NUS team, - and Iconip training data from Cheong Chi Hong for ParsCit with new - ParsCit model retrained. See Current version: 100901 (Coming soon): Incorporate BiblioScript software. See CHANGELOG.txt;
    @@ -109,6 +106,10 @@
  • Other versions:
    +100401d: Added + SectLabel (logical structure parsing) software from the NUS team, + and Iconip training data from Cheong Chi Hong for ParsCit with new + ParsCit model retrained. See CHANGELOG.txt;
    090625b: added documentation for complete re-installation. Improved ParsHed with added line-level CRF model together and post-processing module by NUS team, WSDL file and client for service at NUS and minor bug fixes for ParsCit. See CHANGELOG.txt;
    090316: incorporation of ParsHed (header parsing) software from the NUS team. See CHANGELOG.txt;
    081201: bug fixes and incorporation of byte position offset from the Scienstein.org team. See CHANGELOG.txt;
    @@ -201,6 +202,16 @@

    +

    Citation export formats + ADS + BIB + EndNote + ISI + RIS + WordBib +

    + +
    @@ -230,6 +241,15 @@

    +

    Citation export formats + ADS + BIB + EndNote + ISI + RIS + WordBib +

    +
    @@ -253,6 +273,15 @@

    +

    Citation export formats + ADS + BIB + EndNote + ISI + RIS + WordBib +

    +
    @@ -487,7 +516,7 @@

    Kudos

    ParsCit owes its continued maintenance and support from its user base. Here we'd like to thank them for their help.

    -

    Many thanks to Kris Jack for pointing out problems with the ELF binaries and an appropriate fix. +

    Thanks to Matteo Romanello for the suggestion and permission to incorporate BiblioScript software. Many thanks to Kris Jack for pointing out problems with the ELF binaries and an appropriate fix. Thanks to Cheong Chi Hong for fixing problems with Preprocess.pm (v100401) and contributing the ICONIP training data and XML entity problems in reference string parsing (v100401). Thanks to Priya diff --git a/doc/parsCit.cgi b/doc/parsCit.cgi index 5c9a619..2095e80 100755 --- a/doc/parsCit.cgi +++ b/doc/parsCit.cgi @@ -29,12 +29,13 @@ if ($tmpfile =~ /^([-\@\w.]+)$/) { $tmpfile = $1; } # untaint tm $tmpfile = "/tmp/" . $tmpfile; $0 =~ /([^\/]+)$/; my $progname = $1; my $outputVersion = "1.0"; -my $installDir = "/home/wing.nus/services/parscit/tools"; +#my $installDir = "/home/wing.nus/services/parscit/tools"; +my $installDir = "/home/lmthang/public_html/parsCit"; my $libDir = "$installDir/lib/"; my $logFile = "$libDir/cgiLog.txt"; my $seed = $$; my $debug = 0; -my $loadThreshold = 0.5; +my $loadThreshold = 2; ### END user customizable section $| = 1; # flush output @@ -202,102 +203,112 @@ exit; my $cmd = ""; my $outputBuf = ""; if ($demo == 1 ) { # run demo 1 -$cmd = "nice ./citeExtract.pl "; + # Thang v100901: call BiblioScript + biblioScript($option, $q, $filename, "all"); -if ($option == 1){ -$cmd .= "-m extract_citations"; -} -elsif ($option == 2){ -$cmd .= "-m extract_header"; -} -elsif ($option ==3){ -$cmd .= "-m extract_meta"; -} -elsif ($option == 4){ -$cmd .= "-m extract_section"; -} -elsif ($option == 5){ -$cmd .= "-m extract_all"; -} + $cmd = "nice ./citeExtract.pl "; -$cmd .= " $filename"; -print "Executing $cmd.\n"; -print "Input Method: $inputMethod."; -chdir ("$installDir/bin"); -print "
    [ Show XML output ]"; -print "

    ";
    -$outputBuf = `$cmd`;
    -print CGI::escapeHTML($outputBuf);
    -print "
    "; + if ($option == 1){ + $cmd .= "-m extract_citations"; + } + elsif ($option == 2){ + $cmd .= "-m extract_header"; + } + elsif ($option ==3){ + $cmd .= "-m extract_meta"; + } + elsif ($option == 4){ + $cmd .= "-m extract_section"; + } + elsif ($option == 5){ + $cmd .= "-m extract_all"; + } + + $cmd .= " $filename"; + print "Executing $cmd.\n"; + print "Input Method: $inputMethod."; + chdir ("$installDir/bin"); + print "
    [ Show XML output ]"; + print "
    ";
    +  $outputBuf = `$cmd`;
    +  print CGI::escapeHTML($outputBuf);
    +  print "
    "; } elsif ($demo == 2) { -$cmd = "nice ./citeExtract.pl -i xml "; + # Thang v100901: call BiblioScript + biblioScript($option, $q, $filename, "xml"); -if ($option == 1){ -$cmd .= "-m extract_citations"; -} -elsif ($option == 2){ -$cmd .= "-m extract_header"; -} -elsif ($option ==3){ -$cmd .= "-m extract_meta"; -} -elsif ($option == 4){ -$cmd .= "-m extract_section"; -} -elsif ($option == 5){ -$cmd .= "-m extract_all"; -} + $cmd = "nice ./citeExtract.pl -i xml "; + + if ($option == 1){ + $cmd .= "-m extract_citations"; + } + elsif ($option == 2){ + $cmd .= "-m extract_header"; + } + elsif ($option ==3){ + $cmd .= "-m extract_meta"; + } + elsif ($option == 4){ + $cmd .= "-m extract_section"; + } + elsif ($option == 5){ + $cmd .= "-m extract_all"; + } -$cmd .= " $filename"; -print "Executing $cmd.\n"; -print "Input Method: $inputMethod."; -chdir ("$installDir/bin"); -print "
    [ Show XML output ]"; -print "
    ";
    -$outputBuf = `$cmd`;
    -print CGI::escapeHTML($outputBuf);
    -print "
    "; + + $cmd .= " $filename"; + print "Executing $cmd.\n"; + print "Input Method: $inputMethod."; + chdir ("$installDir/bin"); + print "
    [ Show XML output ]"; + print "
    ";
    +  $outputBuf = `$cmd`;
    +  print CGI::escapeHTML($outputBuf);
    +  print "
    "; } elsif ($demo == 3) { -$cmd = "./parseRefStrings.pl $filename"; -print "Executing $cmd.\n"; -print "Input Method: $inputMethod."; -chdir ("$installDir/bin"); -print "
    [ Show XML output ]"; -print "
    ";
    -$outputBuf = `$cmd`;
    -print CGI::escapeHTML($outputBuf);
    -print "
    "; + # Thang v100901: call BiblioScript + biblioScript(1, $q, $filename, "ref"); + + $cmd = "./parseRefStrings.pl $filename"; + print "Executing $cmd.\n"; + print "Input Method: $inputMethod."; + chdir ("$installDir/bin"); + print "
    [ Show XML output ]"; + print "
    ";
    +  $outputBuf = `$cmd`;
    +  print CGI::escapeHTML($outputBuf);
    +  print "
    "; } else { -print "

    Invalid demo type selected\n"; -print "[ Back to ParsCit Home Page ]\n"; -printTrailer(); -logMessage("# Demo: Incorrected selected\n"); -exit; + print "

    Invalid demo type selected\n"; + print "[ Back to ParsCit Home Page ]\n"; + printTrailer(); + logMessage("# Demo: Incorrected selected\n"); + exit; } if ($option == 5) { -print "
    [ Show SectLabel output ]"; -print "

    ";
    -print (processSections($outputBuf));
    -print "
    "; + print "
    [ Show SectLabel output ]"; + print "
    ";
    +  print (processSections($outputBuf));
    +  print "
    "; } elsif ($option == 4) { -print "
    [ Show SectLabel output ]"; -print "
    ";
    -print (processSections($outputBuf));
    -print "
    "; + print "
    [ Show SectLabel output ]"; + print "
    ";
    +  print (processSections($outputBuf));
    +  print "
    "; } if ($option == 5 || $option == 2) { -print "
    [ Show ParsHed output ]"; -print "
    "; -print (processHeader($outputBuf)); -print "
    "; + print "
    [ Show ParsHed output ]"; + print "
    "; + print (processHeader($outputBuf)); + print "
    "; } if ($option == 5 || $option == 1 || $demo == 2 || $demo == 3) { -print "
    [ Show ParsCit output ]"; -print "
    "; -print (processCitations($outputBuf, $filename)); -print "
    "; + print "
    [ Show ParsCit output ]"; + print "
    "; + print (processCitations($outputBuf, $filename)); + print "
    "; } # remove temporary files @@ -311,6 +322,56 @@ printTrailer(); ### END of main program ### +# Thang v100901: incorporate BiblioScript +sub biblioScript { + my ($option, $q, $fileName, $inputFormat) = @_; + + if($option =~ /^(1|3|5)$/) {# citations requested + # get export types (selected checkboxes) + my @exportTypes = (); + foreach my $type ("ads", "bib", "end", "isi", "ris", "wordbib"){ + #print "Check box $type$demo \"".$q->param("$type$demo")."\"
    "; + if($q->param("$type$demo") eq "on"){ + push(@exportTypes, $type); + } + } + + my $tmpDir = "/tmp/".newTmpFile(); + my $size = scalar(@exportTypes); + if($size > 0){ + chdir ("$installDir/bin"); + + # call to BiblioScript + my $format = $exportTypes[0]; + $cmd = "./BiblioScript/biblio_script.sh -q -i $inputFormat -o $format $fileName $tmpDir"; + system($cmd); + + # reuse the MODS file generated in the first call + for(my $i = 1; $i<$size; $i++){ + $format = $exportTypes[$i]; + $cmd = "./BiblioScript/biblio_script.sh -q -i mods -o $format $tmpDir/parscit_mods.xml $tmpDir"; + system($cmd); + } + + # get the output + foreach $format(@exportTypes){ + open(BIBLIO, "<:utf8", "$tmpDir/parscit.$format"); + my @lines = ; + my $outputBuf .= join("", @lines); + close(BIBLIO); + + print "[ Show $format ]"; + print "
    ";
    +        print CGI::escapeHTML($outputBuf);
    +        print "
    "; + } + print "

    "; + } + } + + system("rm -rf $tmpDir"); +} + sub loadTooHigh { my $load = `uptime`; $load =~ /load average: ([\d.]+)/i; @@ -530,3 +591,10 @@ function exit() TOOLTIP } + +# Thang v100901: method to generate tmp file name +sub newTmpFile { + my $tmpFile = `date '+%Y%m%d-%H%M%S-$$'`; + chomp($tmpFile); + return $tmpFile; +} diff --git a/doc/.htaccess b/doc/tmp.txt similarity index 100% rename from doc/.htaccess rename to doc/tmp.txt diff --git a/tmp.txt b/tmp.txt deleted file mode 100644 index 4c39fe9..0000000 --- a/tmp.txt +++ /dev/null @@ -1,278 +0,0 @@ - - - - - - -J Arrasvuori -J Holm - -Designing Interactive Music Mixing Applications for Mobile Devices -2007 -Proceedings of DIMEA - -[11, 14]. For example, technological installations have been developed for the purposes of questioning specific cultural identities [12], or to shift the role of users from ‘audience’ to collaborator [1]. There are competing perspectives on whether interacting with art will support or limit user identity as “creator.” Some note the potential for interactive art to inspire creativity and a sense of au - -[1] -Arrasvuori, J. & Holm, J. Designing Interactive Music Mixing Applications for Mobile Devices. Proceedings of DIMEA 2007, 20-27 - - - -D J Bem - -Self-perception theory. In -1972 -Advances in Experimental Social Psychology -13 -420--432 -L. Berkowitz (Ed.) - - autonomy [5]. Despite mixed predictions from the field, our expectation was that this system would enhance creative self-perceptions, and thus, creative identity through a process of self-perception [2]. THEORETICAL BACKGROUND Self-perception theory [2] posits that through self- observation people come to determine their own identities, even when behaviors are externally induced. Through self- obser - -[2] -Bem, D.J. (1972). Self-perception theory. In L. Berkowitz (Ed.), Advances in Experimental Social Psychology, 13, 420-432 - - - -E P Bucy -C Tao - -The mediated moderation model of interactivity -2007 -Media Psychology -9 -647--672 - -allations in which audiences have some control over the system. Although interactivity is defined as a neutral concept [8], often it is assumed that interaction will lead to more positive experiences [3, 14, 17]. Previous research on interactive systems suggests that the psychological outcomes associated with digital interaction include enhanced learning, entertainment, and persuasive effects [3, 14, 17]. Ho - -[3] -Bucy, E.P. & Tao, C. (2007). The mediated moderation model of interactivity. Media Psychology, 9, 647-672 - - - -Z Bilda -E Edmonds -D Turnbull - -Interactive Experience in Public Context -Proceedings of CC - -nteractive art in HCI is a topic of continued interest and debate [15]. Despite controversy on the topic, the relationship between technology design and art continues to prompt new systems within HCI [4, 14] An assumption of interactive art design is that interactivity engages users, and enhances user self-reflection. However, there are few empirical tests of these effects. The first goal of this study ( - -[4] -Bilda, Z., Edmonds, E., & Turnbull, D. Interactive Experience in Public Context. Proceedings of CC - - - -J Campbell - -Delusions of dialogue: Control and choice in interactive art -2000 -Leonardo -33 -133--136 - -ativity and a sense of authorship from users [6, 18]. Others have raised concerns about a false sense of creative ownership, suggesting that technical constraints may limit a user’s sense of autonomy [5]. Despite mixed predictions from the field, our expectation was that this system would enhance creative self-perceptions, and thus, creative identity through a process of self-perception [2]. THEORETI -e installation along a range of 1 (Not very interactive) to 7 (highly interactive) supports the notion that, at some level, all art involves a degree of interaction, such as psychological interaction [5, 6]. Second, because all people in the interactive condition had the same amount of control, these findings further reinforce the idea that perceptions of interactivity, rather than objectively defined f - -[5] -Campbell, J. (2000). Delusions of dialogue: Control and choice in interactive art. Leonardo, 33, 133-136 - - - -S Cornock -E Edmonds - -The creative process where the artist is amplified or superseded by the computer -1973 -Leonardo -1 -11--16 - -g perspectives on whether interacting with art will support or limit user identity as “creator.” Some note the potential for interactive art to inspire creativity and a sense of authorship from users [6, 18]. Others have raised concerns about a false sense of creative ownership, suggesting that technical constraints may limit a user’s sense of autonomy [5]. Despite mixed predictions from the field, our e -e installation along a range of 1 (Not very interactive) to 7 (highly interactive) supports the notion that, at some level, all art involves a degree of interaction, such as psychological interaction [5, 6]. Second, because all people in the interactive condition had the same amount of control, these findings further reinforce the idea that perceptions of interactivity, rather than objectively defined f - -[6] -Cornock, S. & Edmonds, E. (1973). The creative process where the artist is amplified or superseded by the computer. Leonardo, 1, 11-16 - - - -E Edmonds -G Turner -L Candy - -Approaches to interactive art systems -2004 -Proceedings of CGIT -113--117 - -provement April 6th, 2009 ~ Boston, MA, USA Figure 3. Many interactive groups were highly active in their use of the system. question of whether or not interactive art actually can influence identity [7, 18] deserves continued attention. As mentioned, future tests using this system may involve a different set of instructions to users. An additional alternative is to alter the design to increase users’ se - -[7] -Edmonds, E., Turner, G., & Candy, L. Approaches to interactive art systems. Proceedings of CGIT 2004, 113-117 - - - -T Erickson - -Five Lenses: Towards a Toolkit for Interaction Design Theories and Practice -2006 -in Interaction Design (Eds. Bagnara, G. Crampton-Smith, & G. Salvendy.) Lawrence Erlbaum -301--310 - -.00 BACKGROUND ON INTERACTIVE ART A trend in HCI is to build interactive art installations in which audiences have some control over the system. Although interactivity is defined as a neutral concept [8], often it is assumed that interaction will lead to more positive experiences [3, 14, 17]. Previous research on interactive systems suggests that the psychological outcomes associated with digital int - -[8] -Erickson, T. (2006) Five Lenses: Towards a Toolkit for Interaction Design Theories and Practice in Interaction Design (Eds. Bagnara, G. Crampton-Smith, & G. Salvendy.) Lawrence Erlbaum: p.301-310. - - - -A L Gonzales -J T Hancock - -Identity shift in computer-mediated environments -2008 -167--185 -Media Psychology - -cted to think of themselves as more creative than those individuals that do not interact with the system. Evidence of psychological “identity shift” through self-observation has been found in weblogs [9], but has not been tried in offline technological spaces. 415 CHI 2009 ~ Creative Thought and Self-Improvement April 6th, 2009 ~ Boston, MA, USA EXPERIMENTAL DESIGN Participants were exposed to an exp -much do you enjoy participating in creative activities?; range 5-35, high score=creative self-perception, a=.85). The second measure was an adjective checklist designed to assess creative personality [9]. Mean ratings of self-perceived creativity were compared across both scales using a t-test. In contrast to our hypothesis, participants exposed to the interactive installation were not more likely to - -[9] -Gonzales, A.L. & Hancock, J.T. (2008). Identity shift in computer-mediated environments. Media Psychology, 167-185. - - - -H G Gough - -A creative personality scale for the Adjective Check List -1979 -Journal of Personality and Social Psychology -37 -1398--1405 -[10] -Gough, H. G. (1979). A creative personality scale for the Adjective Check List. Journal of Personality and Social Psychology, 37, 1398-1405. - - - -K Hook -P Sengers -G Andersson - -Sense and sensibility: evaluations and interactive art -2003 -Proceedings of SIGCHI -241--248 - -c environment. According to “reflective design” philosophies, technology can and should prompt user self-reflection [16]. Interactive art has been assumed to shape user identities and self- awareness [11, 14]. For example, technological installations have been developed for the purposes of questioning specific cultural identities [12], or to shift the role of users from ‘audience’ to collaborator [1]. The - -[11] -Hook, K., Sengers, P., & Andersson, G. (2003). Sense and sensibility: evaluations and interactive art. Proceedings of SIGCHI 2003, 241-248 - - - -W -A D Cheok - -Magic Asian Art -2006 -Proceedings of CHI - -e art has been assumed to shape user identities and self- awareness [11, 14]. For example, technological installations have been developed for the purposes of questioning specific cultural identities [12], or to shift the role of users from ‘audience’ to collaborator [1]. There are competing perspectives on whether interacting with art will support or limit user identity as “creator.” Some note the po - -[12] -Park, E., Kim, B., Salim. W., & Cheok, A.D. Magic Asian Art. Proceedings of CHI 2006 - - - -J L Rasmussen - -Analysis of Likert-Scale Data: A Reinterpretation of Gregoire and Driver -1989 -Psych Bulletin -105 -167--170 -[13] -Rasmussen, J.L. (1989). Analysis of Likert-Scale Data: A Reinterpretation of Gregoire and Driver. Psych Bulletin, 105, 167-170. - - - -D Richards - -Is interactivity actually important? IE -2006 - -nteractive art in HCI is a topic of continued interest and debate [15]. Despite controversy on the topic, the relationship between technology design and art continues to prompt new systems within HCI [4, 14] An assumption of interactive art design is that interactivity engages users, and enhances user self-reflection. However, there are few empirical tests of these effects. The first goal of this study ( -allations in which audiences have some control over the system. Although interactivity is defined as a neutral concept [8], often it is assumed that interaction will lead to more positive experiences [3, 14, 17]. Previous research on interactive systems suggests that the psychological outcomes associated with digital interaction include enhanced learning, entertainment, and persuasive effects [3, 14, 17]. Ho -c environment. According to “reflective design” philosophies, technology can and should prompt user self-reflection [16]. Interactive art has been assumed to shape user identities and self- awareness [11, 14]. For example, technological installations have been developed for the purposes of questioning specific cultural identities [12], or to shift the role of users from ‘audience’ to collaborator [1]. The - -[14] -Richards, D. (2006). Is interactivity actually important? IE 2005 - - - -P Sengers -C Csikszentmihályi - -HCI and the Arts: A Conflicted Convergence -2003 -Proceedings of CHI - -vity, music installation ACM Classification Keywords H5.1 Multimedia Information Systems H5.5 Sound and Music Computing INTRODUCTION Interactive art in HCI is a topic of continued interest and debate [15]. Despite controversy on the topic, the relationship between technology design and art continues to prompt new systems within HCI [4, 14] An assumption of interactive art design is that interactivity - -[15] -Sengers, P., & Csikszentmihályi, C. HCI and the Arts: A Conflicted Convergence? Proceedings of CHI 2003 - - - -P Sengers -K Boehner -S David -J Kaye - -Reflective design -2005 -Proceedings of Critical Computing -49--58 - -n empirical test of the effect of interactivity on the user experience in an artistic environment. According to “reflective design” philosophies, technology can and should prompt user self-reflection [16]. Interactive art has been assumed to shape user identities and self- awareness [11, 14]. For example, technological installations have been developed for the purposes of questioning specific cultural -n both conditions were told to “reflect on themselves, the sound and the space.” Directions were intentionally vague, in order to allow participants to provide their own interpretations of the system [16]. After being exposed to the system, we gave participants multiple questionnaires to fill-out in response to the experience. RESULTS Manipulation Check of Interactivity To ensure that subjects perceiv - -[16] -Sengers, P., Boehner, K., David, S., & Kaye, J. Reflective design. Proceedings of Critical Computing 2005, 49-58 - - - -P Vorderer -S Knobloch -H Schramm - -Does entertainment suffer from interactivity? The impact of watching an interactive TV movie on viewers’ experience of entertainment -2001 -Media Psychology -3 -343--363 - -allations in which audiences have some control over the system. Although interactivity is defined as a neutral concept [8], often it is assumed that interaction will lead to more positive experiences [3, 14, 17]. Previous research on interactive systems suggests that the psychological outcomes associated with digital interaction include enhanced learning, entertainment, and persuasive effects [3, 14, 17]. Ho - -[17] -Vorderer, P., Knobloch, S., & Schramm, H. (2001). Does entertainment suffer from interactivity? The impact of watching an interactive TV movie on viewers’ experience of entertainment. Media Psychology, 3, 343-363. - - - -K D D Willis - -User authorship and creativity within interactivity -2006 -Proceedings of Multimedia -731--735 - -g perspectives on whether interacting with art will support or limit user identity as “creator.” Some note the potential for interactive art to inspire creativity and a sense of authorship from users [6, 18]. Others have raised concerns about a false sense of creative ownership, suggesting that technical constraints may limit a user’s sense of autonomy [5]. Despite mixed predictions from the field, our e -provement April 6th, 2009 ~ Boston, MA, USA Figure 3. Many interactive groups were highly active in their use of the system. question of whether or not interactive art actually can influence identity [7, 18] deserves continued attention. As mentioned, future tests using this system may involve a different set of instructions to users. An additional alternative is to alter the design to increase users’ se - -[18] -Willis, K.D.D. User authorship and creativity within interactivity. Proceedings of Multimedia 2006, 731-735 - - - - \ No newline at end of file diff --git a/tmpDir/parscit.bib b/tmpDir/parscit.bib new file mode 100644 index 0000000..692a15b --- /dev/null +++ b/tmpDir/parscit.bib @@ -0,0 +1,68 @@ +@Article{d1e50, +author="Deerwester, S. +and Furnas, G. +and Landauer, T. +and Harshman, R.", +title="Indexing by Latent Semantic Anaysis", +journal="Journal of the American Society of Information Science", +pages="41--6" +} + +@Article{d1e260, +journal="Journal of Computer Science and Information Management" +} + +@Article{d1e288, +author="Wendlandt, E. +and Driscoll, R.", +title="Incorporating a semantic analysis into a document retrieval strategy", +journal="CACM", +pages="54--48" +} + +@Book{d1e87, +author="Halliday, M. A. K.", +title="An Introduction to Functional Grammar. Edward", +year="1985", +address="Arnold, London" +} + +@Book{d1e121, +author="Jang, S.", +title="Extracting Context from Unstructured Text Documents by Content Word Density", +year="1997" +} + +@Book{d1e202, +author="Shin, H.", +title="Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System", +year="1997" +} + +@Book{d1e236, +author="Shin, H. +and Stach, J.", +title="Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval", +year="1999" +} + +@InCollection{d1e7, +author="Boyd, R. +and Driscoll, J. +and Syu, I.", +title="incorporating Semantics Within a Connectionist Model and a Vector Processing Model", +booktitle="In Proceedings of the TREC-2", +year="1994", +pages="NIST." +} + +@InCollection{d1e167, +author="Moffat, A. +and Davis, R. +and Wilkinson, R. +and Zobel, J.", +title="Retrieval of Partial Documents", +booktitle="In Proceedings of TREC-2", +year="1994" +} + diff --git a/tmpDir/parscit_mods.xml b/tmpDir/parscit_mods.xml new file mode 100644 index 0000000..cfa0188 --- /dev/null +++ b/tmpDir/parscit_mods.xml @@ -0,0 +1,281 @@ + + + + + Indexing by Latent Semantic Anaysis + + text + + S + Deerwester + + author + + + + G + Furnas + + author + + + + T + Landauer + + author + + + + R + Harshman + + author + + + + + Journal of the American Society of Information Science + + + continuing + + + + 41 + 6 + + + journal + academic journal + + d1e50 + + + text + + + Journal of Computer Science and Information Management + + + continuing + + journal + academic journal + + d1e260 + + + + Incorporating a semantic analysis into a document retrieval strategy + + text + + E + Wendlandt + + author + + + + R + Driscoll + + author + + + + + CACM + + + continuing + + + + 54 + 48 + + + journal + academic journal + + d1e288 + + + + An Introduction to Functional Grammar. Edward + + text + + M + A + K + Halliday + + creator + + + + 1985 + + Arnold, London + + monographic + + d1e87 + + + + Extracting Context from Unstructured Text Documents by Content Word Density + + text + + S + Jang + + creator + + + + 1997 + monographic + + d1e121 + + + + Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System + + text + + H + Shin + + creator + + + + 1997 + monographic + + d1e202 + + + + Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval + + text + + H + Shin + + creator + + + + J + Stach + + creator + + + + 1999 + monographic + + d1e236 + + + + incorporating Semantics Within a Connectionist Model and a Vector Processing Model + + text + + R + Boyd + + author + + + + J + Driscoll + + author + + + + I + Syu + + author + + + + + In Proceedings of the TREC-2 + + + + NIST. + + + + + monographic + 1994 + + collection + + d1e7 + + + + Retrieval of Partial Documents + + text + + A + Moffat + + author + + + + R + Davis + + author + + + + R + Wilkinson + + author + + + + J + Zobel + + author + + + + + In Proceedings of TREC-2 + + + monographic + 1994 + + collection + + d1e167 + + + diff --git a/tmpDir/parscit_temp.xml b/tmpDir/parscit_temp.xml new file mode 100644 index 0000000..5f988f3 --- /dev/null +++ b/tmpDir/parscit_temp.xml @@ -0,0 +1,131 @@ + + + + + + +R Boyd +J Driscoll +I Syu + +incorporating Semantics Within a Connectionist Model and a Vector Processing Model +1994 +In Proceedings of the TREC-2 +NIST. + +ed out in Section 1, all terms are not known in partial text retrieval. Further, our approach is based on semantic weight rather than word frequency. Therefore any frequency based measures defined by Boyd et al. (1994) and Wendlandt (1991) need to be built from the probabilities of individual semantic categories. Those modifications are described below. As a simplifying assumption, we assume SEMCATs have a uniform +ar structures function as 12 minor predication and as such are loci of semantic intent or coherence. In order to facilitate the use of long runs as predictors, we modified the traditional measures of Boyd et al. (1994), Wendlandt (1991) to accommodate semantic categories and partial text retrieval. The revised metrics and the computational method we propose were used in the statistical experiments presented above. + +Boyd, Driscoll, Syu, 1994 +Boyd R., Driscoll J, and Syu I. (1994) incorporating Semantics Within a Connectionist Model and a Vector Processing Model. In Proceedings of the TREC-2, NIST. + + + +S Deerwester +G Furnas +T Landauer +R Harshman + +Indexing by Latent Semantic Anaysis +1990 +Journal of the American Society of Information Science +41--6 +Deerwester, Furnas, Landauer, Harshman, 1990 +Deerwester S., Furnas G., Landauer T., and Harshman R. (1990) Indexing by Latent Semantic Anaysis. Journal of the American Society of Information Science 41-6. + + + +M A K Halliday + +An Introduction to Functional Grammar. Edward +1985 +Arnold, London + +as a semantic predictor. We examined all the long runs of the Jang (1997) collection and discovered most of them originate from the prepositional phrase and subject complement positions. According to Halliday (1985), a preposition is explained as a minor verb. It functions as a minor Predicator having a nominal group as its complement. Thus the internal structure of 'across the lake' is like that of 'crossing th +hort run lengths are drawn from different populations, (2) our observation that these long runs of content words originate from the prepositional phrase and subject complement positions. According to Halliday (1985) those grammar structures function as 12 minor predication and as such are loci of semantic intent or coherence. In order to facilitate the use of long runs as predictors, we modified the traditional + +Halliday, 1985 +Halliday M.A.K. (1985) An Introduction to Functional Grammar. Edward Arnold, London. + + + +S Jang + +Extracting Context from Unstructured Text Documents by Content Word Density +1997 +M.S. Thesis +University of Missouri-Kansas City + +Runs Partial Information Retrieval has to with detection of main ideas. Main ideas are topic sentences that have central meaning to the text. Our method of detecting main idea paragraphs extends from Jang (1997) who observed that after stemming and stopping a document, long runs of content words cluster. Content word runs are a sequence of content words with a function word(s) prefix and suffix. These runs c +erify this, we designed a methodology to incorporate semantic features into information retrieval and examined long runs of content words as a semantic predictor. We examined all the long runs of the Jang (1997) collection and discovered most of them originate from the prepositional phrase and subject complement positions. According to Halliday (1985), a preposition is explained as a minor verb. It functions +tions, it would suggest that the speaker is saying something important and the longer runs of content words would signal a locus of the speaker's intention. Extending from the statistical analysis of Jang (1997) and our observations of those long runs in the collection, we give a basic assumption of OUT study: Long runs of content words contain significant semantic information that a speaker wants to express +ogy. 3.1 Revised Probability and Vector Processing In order to understand the calculation of SEMCATs, it is helpful to look at the structure 8 of a preprocessed document. One document &quot;Barbie&quot; in the Jang (1997) collection has a total of 1,468 words comprised of 755 content words and 713 function words. The document has 17 paragraphs. Filtering out function words using the Brown Corpus exposed the runs of co +raphs with long runs, computing and summing the semantic coherence of the longest runs only, (3) ranking the eligible paragraphs for retrieval based upon their semantic weights relative to the query. Jang (1997) established that the distribution of long runs of content words and short runs of content words in a collection of paragraphs are drawn from different populations. This implies 10 that either long ru + +Jang, 1997 +Jang S. (1997) Extracting Context from Unstructured Text Documents by Content Word Density. M.S. Thesis, University of Missouri-Kansas City. + + + +A Moffat +R Davis +R Wilkinson +J Zobel + +Retrieval of Partial Documents +1994 +In Proceedings of TREC-2 +Moffat, Davis, Wilkinson, Zobel, 1994 +Moffat A., Davis R., Wilkinson, R., and Zobel J. (1994) Retrieval of Partial Documents. In Proceedings of TREC-2. + + + +H Shin + +Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System +1997 +M.S. Thesis +University of Missouri Kansas City + +between the sum of long run SEMCAT weights and the semantic coherence of a paragraph, the total paragraph SEMCAT weight. A detailed description of these experiments and their outcome are described in Shin (1997, 1999). The results of the experiments and the implications of those results relative to the method we propose are discussed below. Table 3 gives the SEMCAT weights for seventeen paragraphs randomly + +Shin, 1997 +Shin H. (1997) Incorporating Semantic Categories (SEMCATs) into a Partial Information Retrieval System. M.S. Thesis, University of Missouri Kansas City. + + + +H Shin +J Stach + +Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval +1999 +Shin, Stach, 1999 +Shin H., Stach J. (1999) Incorporating Probabilistic Semantic Categories (SEMCATs) Into Vector Space Techniques for Partial Document Retrieval. + + +1999 +Journal of Computer Science and Information Management +2 +to appear + +en the sum of long run SEMCAT weights and the semantic coherence of a paragraph, the total paragraph SEMCAT weight. A detailed description of these experiments and their outcome are described in Shin (1997, 1999). The results of the experiments and the implications of those results relative to the method we propose are discussed below. Table 3 gives the SEMCAT weights for seventeen paragraphs randomly chosen + +1999 +Journal of Computer Science and Information Management, vol. 2, No. 4, December 1999, to appear. + + + +E Wendlandt +R Driscoll + +Incorporating a semantic analysis into a document retrieval strategy +1991 +CACM +31 +54--48 +Wendlandt, Driscoll, 1991 +Wendlandt E. and Driscoll R. (1991) Incorporating a semantic analysis into a document retrieval strategy. CACM 31, pp. 54-48. + + + + \ No newline at end of file