GitHub - cocoxu/Shakespeare

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 180 Commits
Notes		Notes
RomeoJuliet		RomeoJuliet
bilingual-sentence-aligner		bilingual-sentence-aligner
conf		conf
data		data
dictionary		dictionary
eval		eval
mert		mert
models		models
models2		models2
other_authors		other_authors
paper		paper
plays		plays
python		python
python_wei		python_wei
python_wei_2		python_wei_2
scrapper2		scrapper2
scrapper3		scrapper3
scripts		scripts
sonnets		sonnets
word-alignment/data		word-alignment/data
PLAN_COLING		PLAN_COLING
README		README
README_DATA		README_DATA
README_LICENSE		README_LICENSE
urls.txt		urls.txt

Repository files navigation

Please refer to our paper entitled "Paraphrasing for Style" at COLING 2012 for more details.
Please also cite our paper accordingly.


@inproceedings{xu2012paraphrasing,
  title={Paraphrasing for Style},
  author={Xu, Wei and Ritter, Alan and Dolan, Bill and Grishman, Ralph and Cherry, Colin},
  booktitle={COLING},
  pages={2899--2914},
  year={2012}
}

============================================================================================


python/crawl_{plays|sonnets}.py
crawls the nfs.sparknotes.com website and downloads html

python/scrapper_soupparser.py
extracts original and modern english sentences from the HTML

scripts/extract_lines.sh
runs scrapper_soupparser.py on all the HTML files and puts output into format suitable for sentence alignment (in data/plays/align)

scrapper2/
scripts and data for scrapping parallel texts from www.enotes.com

data/align/plays/ from nfs.sparknotes.com
data/align/plays2/ from www.enotes.com 

data/align/plays/notmerged
data/align/plays2/notmerged
Contains the aligned sentences at the level of HTML pages

data/align/plays/merged
data/align/plays2/merged
All sentences merged together into plays.
The bilingual sentence aligner doesn't really seem to work with very small files (e.g. the HTML pages).  Presumably
it should be possible to do a better job with the smaller files, since they are more tightly aligned, but I don't
think this is the type of data the sentence aligner was designed for.
Anyway, it seems to work fine on the merged plays.  Out of 31,718 original lines the sentence aligner finds 21,079
alignments.  It may be possible to improve on this...
Also note: in the merged plays the pages are in random order.  I think this probably shouldn't make much difference,
but might be worth fixing...

data/align/plays/model_16plays
Moses model trained on 16 plays (except R&J), following the instructions on http://www.statmt.org/moses_steps.html
  moses.ini
  moses-bin.ini
The two above use same parameters but provide different outputs (?)
  ./tuning/ 
I put R&J over there, planning to split into dev and test.

data/test_small
  Wei_facebook.txt
A small set of Facebook posts collected by hand from my Facebook.
  03_natural_tweets.pl
  natural_tweets_10000.txt
Made an attempt to find 'meaningful' Tweets by heuristic rules, e.g. propotion of words in dictionary etc. 

data/shakespere.dict
A Shakespeare Glossary from http://www.william-shakespeare.info/
scripts and html pages used:
  dictionary/*.htm 
  python/crawl_dictionary.py	
  python/scrap_dictionary.py

mert/
Contains held out dev data from Romeo & Juliet for discriminative training.
mert/mert-work/moses.ini contains tuned weights for the 16 plays model

eval/
Files for evaluating BLEU.  Below are some results from the 16 plays model with/without MERT:

First note that the 2nd translation (enotes.com) appears to be much closer to the original text than the first (sparknotes):
-bash-4.1$ ~/mt/mosesdecoder/scripts/generic/multi-bleu.perl ascii.romeojuliet_tokenized_lower_original < ascii.romeojuliet_tokenized_lower_modern.1
BLEU = 24.67, 56.5/30.2/18.8/11.5 (BP=1.000, ratio=1.047, hyp_len=4312, ref_len=4120)
-bash-4.1$ ~/mt/mosesdecoder/scripts/generic/multi-bleu.perl ascii.romeojuliet_tokenized_lower_original < ascii.romeojuliet_tokenized_lower_modern.2
BLEU = 52.30, 75.9/57.7/46.0/37.1 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120)

Current BLEU scores (looks like the larger LM has a big effect):
16and7plays_16LM.1
BLEU = 27.35, 59.8/33.3/21.0/13.4 (BP=1.000, ratio=1.041, hyp_len=4287, ref_len=4120)
16and7plays_16LM_mert.1
BLEU = 27.61, 59.9/33.8/21.4/13.5 (BP=1.000, ratio=1.039, hyp_len=4281, ref_len=4120)
16and7plays_16LM.2
BLEU = 52.68, 77.4/58.6/46.2/36.7 (BP=1.000, ratio=1.041, hyp_len=4289, ref_len=4120)
16and7plays_16LM_mert.2
BLEU = 53.27, 77.5/59.3/46.9/37.4 (BP=1.000, ratio=1.045, hyp_len=4306, ref_len=4120)
16and7plays_37LM.1
BLEU = 30.56, 61.1/36.1/24.2/16.3 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120)
16and7plays_37LM_mert.1
BLEU = 30.54, 60.7/36.0/24.3/16.4 (BP=1.000, ratio=1.048, hyp_len=4319, ref_len=4120)
16and7plays_37LM.2
BLEU = 57.80, 79.3/63.0/52.0/42.9 (BP=1.000, ratio=1.045, hyp_len=4305, ref_len=4120)
16and7plays_37LM_mert.2
BLEU = 57.32, 78.9/62.3/51.5/42.7 (BP=1.000, ratio=1.052, hyp_len=4335, ref_len=4120)
16plays_37LM.1
BLEU = 30.63, 61.1/36.1/24.2/16.5 (BP=1.000, ratio=1.032, hyp_len=4250, ref_len=4120)
16plays_37LM_mert.1
BLEU = 30.53, 60.8/35.9/24.2/16.4 (BP=1.000, ratio=1.030, hyp_len=4244, ref_len=4120)
16plays_37LM.2
BLEU = 55.54, 78.0/61.1/49.6/40.2 (BP=1.000, ratio=1.036, hyp_len=4267, ref_len=4120)
16plays_37LM_mert.2
BLEU = 56.02, 78.0/60.9/50.1/41.4 (BP=1.000, ratio=1.040, hyp_len=4285, ref_len=4120)
singleword1.1
BLEU = 24.39, 56.4/29.8/18.6/11.3 (BP=1.000, ratio=1.047, hyp_len=4312, ref_len=4120)
singleword1.2
BLEU = 51.51, 75.8/56.9/45.1/36.2 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120)


./models/Lexicons/singleword1/
A very simple (not sure about correctness) dictionary-based model, which can translate "No way excuse his disadvantages when we do bear" into "No way excuse his foils when we do bear". The phrase table P(o|m) based on a verbatim dictionary (1520 pairs) and the frequencies of the 'original' words in the 37 Shakespeare plays. E.g. 
   - There are two entries of Shakespearean word 'abrook' in the dictionary 'abrook -> abide' and 'abrook' -> 'brook'. Extend it to 'modern -> original' phrase table, by considering the reverse direction (e.g. 'abide -> abrook' etc ), the identical word pair (e.g. 'abide -> abide' etc), and adding suffixes (e.g. 'abides -> abrooks', 'abideed -> abrooked', etc)
   - Look into the language model built on 37 original plays (37plays_tokenized_lowercased.original.1gram) and derive P(o|m) based on unigram conditional probability. E.g. 
    abide ||| abide ||| 0.940088299386192
    abide ||| abrook ||| 0.0599117006138077
    brook ||| brook ||| 0.95608871056976
    brook ||| abrook ||| 0.0439112894302401