The idea of this project was to take the word-vector-space projection of a novel and perform a wavelet transform on it. The results of wavelet transforms on images can produce an effect of ghostly, echoing expansion, and I thought it would be worth seeing what that might look like on text.
- groovy
encode-transform.groovy
will grabde.sciss:jwave
- python 3
- annoy
- full text of The Waves by Virginia Woolf, currently available at https://ebooks.adelaide.edu.au/w/woolf/virginia/w91w/
- word vectors generated from Project Gutenberg novels by @aparrish, currently available at https://s3.amazonaws.com/aparrish/novel-vectors-word2vec.gz
- Preprocess
- Drop the text in the same directory as the scripts as
waves.txt
. - The text of The Waves at the above link has special characters (smart quotes and em dashes) which are not represented in
novel-vectors-word2vec
. I did a lazy search-and-replace to convert those to'
,''
, and--
.
- Drop the text in the same directory as the scripts as
groovy tokenize.groovy
groovy encode-transform.groovy
python3 vecs2text.py
Output will be in the-wavelets.txt
.
- The Wavelets
- Bonus: Haar of Darkness (Haar wavelet transform of Conrad's Heart of Darkness)
The results are slightly entertaining, but iffy as a legible text. It's reassuring to see that the average value for both texts (the first word in the transformed text) roughly aligns with the mood of the text. Subsequent words represent the oscillations between sections of text, which I think is harder to intuit than pixel values representing oscillations in sections of an image.
Specifically, the second-through-fourth words of The Wavelets represent that the second half of The Waves is more Grandmother
than the first, the second quarter (and to a lesser extent last quarter) more Mas'r
than the first, and the last quarter (and to a lesser extent second quarter) more Meadows
than the third.
- Accept command line arguments for file names
- Smarter tokenization
- Low-pass filtered output for "summarization"
- Merging two works by averaging in the frequency domain and wavelet reconstruction