# SCAF - Example

This notebook gives a example for SCAF. We run through training word embedding models, building the time series model and running the change detection.

#### Setup 
We use the small [LeeCorpus](http://faculty.sites.uci.edu/mdlee/similarity-data/) and have removed any special characters, punctuation and numbers. We have duplicated the corpus 10 times and perturb the last 5 versions by replacing every second word "the" with "in". Afterwards we have convert the full text to the [Google Ngram format](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html).

In [1]:
from __future__ import print_function

import os
import shutil
import pandas as pd
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from scaf.data import DataStore
from scaf.jobs import Training, BuildTimeseries, ChangeDetectionJob

2019-03-20 22:21:37,307 : INFO : 'pattern' package not found; tag filters are not available for English


Specify paths to corpus and output directory.

In [2]:
file_path = '../scaf/tests/test_data'
# Original corpus files, ngrams and frequency files
data_file = lambda file_name: os.path.join(file_path, file_name)
orig_corpus_files = ['lee.ngrams', 'lee_modified.ngrams']
orig_corpus_freq_files = ['lee.freq', 'lee_modified.freq']
# Learned corpora
corpus_path = os.path.join(file_path, 'corpus')
corpus_file = lambda file_name: os.path.join(corpus_path, file_name)
# Output directory for embeddings, store and change detection
output_path = os.path.join(file_path, 'output')
output_file = lambda file_name: os.path.join(output_path, file_name)

# Modified word
CHANGED_WORD = 'in'

Clean previous files

In [3]:
for f in [corpus_path, output_path]:
    if os.path.exists(f):
        shutil.rmtree(f)
    os.makedirs(f)

---

## Embedding training

We learn 10 embedding models for time periods $1,2,3 \dots, 10$

In [4]:
embedding_config = {
    'input': '',
    'output': '',
    'corpus_building_mode': 'ignore',
    'gensim_params': {
        'size': 25,
        'sg': 1,
        'negative': 5
    }
}

In [5]:
def train_model(corpus):
    embedding_config['input'] = corpus_file('{}'.format(i))
    embedding_config['output_path'] = output_path
    t = Training(embedding_config)
    t.execute()
    return os.path.join(output_file(corpus), '{}_model'.format(corpus))

In [6]:
models = []
# 5 times cleaned lee corpus
for i in range(1, 6):
    f = corpus_file('{}'.format(i))
    shutil.copyfile(data_file(orig_corpus_files[0]), f)
    model_file = train_model('{}'.format(i))
    models.append(model_file)
# 5 times perturbed and cleaned lee corpus
for i in range(6, 11):
    f = corpus_file('{}'.format(i))
    shutil.copyfile(data_file(orig_corpus_files[1]), f)
    model_file = train_model('{}'.format(i))
    models.append(model_file)

2019-03-20 22:21:37,565 : INFO : [TRAIN] Start building corpus.
2019-03-20 22:21:37,567 : INFO : collecting all words and their counts
2019-03-20 22:21:37,571 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-03-20 22:21:37,584 : INFO : collected 1580 word types from a corpus of 18540 raw words and 3708 sentences
2019-03-20 22:21:37,586 : INFO : Loading a fresh vocabulary
2019-03-20 22:21:37,590 : INFO : min_count=1 retains 1580 unique words (100% of original 1580, drops 0)
2019-03-20 22:21:37,592 : INFO : min_count=1 leaves 18540 word corpus (100% of original 18540, drops 0)
2019-03-20 22:21:37,602 : INFO : deleting the raw counts dictionary of 1580 items
2019-03-20 22:21:37,603 : INFO : sample=1e-05 downsamples 1580 most-common words
2019-03-20 22:21:37,603 : INFO : downsampling leaves estimated 2226 word corpus (12.0% of prior 18540)
2019-03-20 22:21:37,605 : INFO : estimated required memory for 1580 words and 25 dimensions: 1106000 bytes
2019-03-20 22:

2019-03-20 22:21:37,950 : INFO : min_count=1 retains 1580 unique words (100% of original 1580, drops 0)
2019-03-20 22:21:37,951 : INFO : min_count=1 leaves 18540 word corpus (100% of original 18540, drops 0)
2019-03-20 22:21:37,963 : INFO : deleting the raw counts dictionary of 1580 items
2019-03-20 22:21:37,964 : INFO : sample=1e-05 downsamples 1580 most-common words
2019-03-20 22:21:37,964 : INFO : downsampling leaves estimated 2226 word corpus (12.0% of prior 18540)
2019-03-20 22:21:37,965 : INFO : estimated required memory for 1580 words and 25 dimensions: 1106000 bytes
2019-03-20 22:21:37,972 : INFO : resetting layer weights
2019-03-20 22:21:38,000 : INFO : [TRAIN] Training epoch 1.
2019-03-20 22:21:38,001 : INFO : training model with 8 workers on 1580 vocabulary and 25 features, using sg=1 hs=0 sample=1e-05 negative=5 window=4
2019-03-20 22:21:38,029 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-20 22:21:38,031 : INFO : worker thread finished; awaitin

2019-03-20 22:21:38,369 : INFO : resetting layer weights
2019-03-20 22:21:38,393 : INFO : [TRAIN] Training epoch 1.
2019-03-20 22:21:38,394 : INFO : training model with 8 workers on 1580 vocabulary and 25 features, using sg=1 hs=0 sample=1e-05 negative=5 window=4
2019-03-20 22:21:38,414 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-20 22:21:38,421 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-20 22:21:38,422 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-20 22:21:38,422 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-20 22:21:38,423 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-20 22:21:38,424 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-20 22:21:38,424 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-20 22:21:38,425 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-

2019-03-20 22:21:38,795 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-20 22:21:38,796 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-20 22:21:38,796 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-20 22:21:38,797 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-20 22:21:38,798 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-20 22:21:38,798 : INFO : training on 18540 raw words (2213 effective words) took 0.0s, 102148 effective words/s
2019-03-20 22:21:38,800 : INFO : [TRAIN] Training epoch 2.
2019-03-20 22:21:38,800 : INFO : training model with 8 workers on 1580 vocabulary and 25 features, using sg=1 hs=0 sample=1e-05 negative=5 window=4
2019-03-20 22:21:38,858 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-20 22:21:38,859 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-20 22:21:38,860 : INFO : wor

2019-03-20 22:21:39,209 : INFO : [TRAIN] Training epoch 2.
2019-03-20 22:21:39,209 : INFO : training model with 8 workers on 1580 vocabulary and 25 features, using sg=1 hs=0 sample=1e-05 negative=5 window=4
2019-03-20 22:21:39,231 : INFO : worker thread finished; awaiting finish of 7 more threads
2019-03-20 22:21:39,233 : INFO : worker thread finished; awaiting finish of 6 more threads
2019-03-20 22:21:39,234 : INFO : worker thread finished; awaiting finish of 5 more threads
2019-03-20 22:21:39,235 : INFO : worker thread finished; awaiting finish of 4 more threads
2019-03-20 22:21:39,236 : INFO : worker thread finished; awaiting finish of 3 more threads
2019-03-20 22:21:39,236 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-03-20 22:21:39,237 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-03-20 22:21:39,238 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-03-20 22:21:39,239 : INFO : training on 18540 raw words (217

---

## Build store

Now we combine the word embedding information with frequency information and build a store.

In [7]:
models

['../scaf/tests/test_data/output/1/1_model',
 '../scaf/tests/test_data/output/2/2_model',
 '../scaf/tests/test_data/output/3/3_model',
 '../scaf/tests/test_data/output/4/4_model',
 '../scaf/tests/test_data/output/5/5_model',
 '../scaf/tests/test_data/output/6/6_model',
 '../scaf/tests/test_data/output/7/7_model',
 '../scaf/tests/test_data/output/8/8_model',
 '../scaf/tests/test_data/output/9/9_model',
 '../scaf/tests/test_data/output/10/10_model']

We align the embedding with Procrustes analysis. Then we similarity time series by computing the cosine similarity of consecutive embeddings for each word.

In [8]:
b = BuildTimeseries(models, output_file=output_file('sim.ts'), alignment_mode='procrustes')
b.execute()

2019-03-20 22:21:39,518 : INFO : [BUILD_TS] Starting building time series.
2019-03-20 22:21:39,519 : INFO : [BUILD_TS] Loading embeddings.
2019-03-20 22:21:39,958 : INFO : [BUILD_TS] Finished loading embeddings.
2019-03-20 22:21:39,960 : INFO : [BUILD_TS] Starting alignment.
2019-03-20 22:21:39,960 : INFO : [ALIGN] Starting alignment with mode procrustes
2019-03-20 22:21:39,971 : INFO : [ALIGN] Status 0.0%
2019-03-20 22:21:39,985 : INFO : [ALIGN] Status 11.11111111111111%
2019-03-20 22:21:39,996 : INFO : [ALIGN] Status 22.22222222222222%
2019-03-20 22:21:40,010 : INFO : [ALIGN] Status 33.33333333333333%
2019-03-20 22:21:40,022 : INFO : [ALIGN] Status 44.44444444444444%
2019-03-20 22:21:40,033 : INFO : [ALIGN] Status 55.55555555555556%
2019-03-20 22:21:40,044 : INFO : [ALIGN] Status 66.66666666666666%
2019-03-20 22:21:40,056 : INFO : [ALIGN] Status 77.77777777777779%
2019-03-20 22:21:40,071 : INFO : [ALIGN] Status 88.88888888888889%
2019-03-20 22:21:40,094 : INFO : [ALIGN] Status 100%
2

Build frequency time series from original frequency files

In [9]:
original = pd.read_csv(data_file(orig_corpus_freq_files[0]), sep='\t',
                       names=('word', 'year', 'match_count', 'volume_count'), quoting=3)
modified = pd.read_csv(data_file(orig_corpus_freq_files[1]), sep='\t',
                       names=('word', 'year', 'match_count', 'volume_count'), quoting=3)
merged = original[['word', 'match_count']].join(modified[['word', 'match_count']].set_index('word'),
                                                on='word', rsuffix='_b')
merged['word_type'] = 'X'
for i in range(1, 6):
    merged[str(i)] = merged['match_count']
for i in range(6, 11):
    merged[str(i)] = merged['match_count_b']
del merged['match_count']
del merged['match_count_b']
merged.to_csv(output_file('freq.ts'), index=False, quoting=3, header=None)

# Print series for the unperturbed word 'a' and the perturbed word 'in'
print(merged[merged['word'] == 'a'])
print(merged[merged['word'] == 'in'])

    word word_type   1   2   3   4   5   6   7   8   9  10
784    a         X  84  84  84  84  84  84  84  84  84  84
    word word_type   1   2   3   4   5    6    7    8    9   10
953   in         X  93  93  93  93  93  247  247  247  247  247


Finally build data store

In [10]:
store = DataStore()
store.load_data(output_file('sim.ts'), output_file('freq.ts'))
store.to_file(output_file('sgns_procrustes_0.5.store'))

2019-03-20 22:21:40,327 : INFO : Start bulding temp dict
2019-03-20 22:21:40,759 : INFO : Finished building temp dict
2019-03-20 22:21:40,761 : INFO : Start building final store
2019-03-20 22:21:41,149 : INFO : Finished building final store


In [11]:
store['a']

array([[nan, 1.000000000000001, 1.000000000000001, 1.000000000000001,
        1.000000000000001, 0.9999952246766742, 0.999999999999999,
        0.999999999999999, 0.999999999999999, 0.9999999999999992],
       [84, 84, 84, 84, 84, 84, 84, 84, 84, 84]], dtype=object)

In [12]:
store['in']

array([[nan, 1.0000000000000018, 1.0000000000000018, 1.0000000000000018,
        1.0000000000000018, 0.9999903644887552, 0.999999999999999,
        0.999999999999999, 0.999999999999999, 0.9999999999999994],
       [93, 93, 93, 93, 93, 247, 247, 247, 247, 247]], dtype=object)

The embedding of the unperturbed word 'a' is very stable, i.e., high similarity values, and frequency does not change.
The embedding similarity for the perturbed word 'in' drops after the a few time periods and frequency increases.

---

## Change detection

Finally run the change detection on the built time series store.

In [13]:
change_detection_config = {
    'model_file': output_file('sgns_procrustes_0.5.store'),
    'tp_file': data_file('changed_vocab'),
    'output_file': output_file('result.cd.eval'),
    'cd_method': 'cusum_2d',
    'store_transformations': {
        'measure': 'padcosdist',
        'percentual': 'True',
        'normalize': 'False'
    },
    'eval_mode': 'full',
    'store_rank_list': True
}

In [14]:
job = ChangeDetectionJob(change_detection_config)
job.execute()

2019-03-20 22:21:41,222 : INFO : [CDJ] Prepare store.
2019-03-20 22:21:43,328 : INFO : [CDJ] Finished preparing store.
2019-03-20 22:21:43,329 : INFO : [CDJ] Init change detection.
2019-03-20 22:21:43,330 : INFO : [CDJ] Starting change detection.
2019-03-20 22:21:43,331 : INFO : [CDJ] Status 000.0000%
2019-03-20 22:21:43,352 : INFO : [CDJ] Status 063.2911%
2019-03-20 22:21:43,366 : INFO : [CDJ] Status 100.0000%
2019-03-20 22:21:43,383 : INFO : [CDJ] Finished change detection.
2019-03-20 22:21:43,385 : INFO : [CDJ] Storing result.
2019-03-20 22:21:43,392 : INFO : [CDJ] Finished storing result.


Inspect detection

In [15]:
df = pd.read_csv(output_file('result.ranked'))
df.head(n=10)

Unnamed: 0,word,time,score,rank
0,in,6,1.655924,1.0
1,listened,6,2.6e-05,2.0
2,new,6,2e-05,3.0
3,unregistered,6,1.9e-05,4.0
4,called,6,1.4e-05,5.0
5,of,6,1.2e-05,6.0
6,georgian,6,1.2e-05,7.0
7,approximately,6,1.1e-05,8.0
8,and,6,1e-05,9.0
9,biased,6,1e-05,10.0


SCAF correctly identifies the the point in time 6 when 'in' semantically changed. The scores also quantify the magnitude of the shift. All other words have only changed slightly.