# Evaluate Baseline Embeddings

# Evaluate Baseline Embeddings

Re-write the steps in Levy's evaluate_bible.sh and evaluate_europarl.sh in Python, so that we can analyse the  sampling distribution of accuracies resulting from mupltiple randomly seeded embeddings.

In [None]:
%run ./evaluation_lib.ipynb

In [None]:
vecs_dir_prefix = 'sgns_baseline'
training_corpus = 'bible'
results_spreadsheet = training_corpus + '_eval.xlsx'
precision_at_N = 10

In [None]:
sample_dist = evaluate_sample_distribution(training_corpus, vecs_dir_prefix, lang_list=['es','fi','fr'], precision_at_N=precision_at_N)

In [None]:
df_results = pd.DataFrame(sample_dist)

In [None]:
sample_mean = df_results.mean()
sample_mean

### Significance Testing
Perform a two-tailed hypothesis test (Student's t-test) at the 5% significance level on the test results sample, versus the true population of possible results of the evaluation benchmarks. In this case, the true population is an infinite set of all the possible test results that could be collected for these benchmarks.
<br><br>
\begin{equation}
H_0: \mu_{pop} \approx \mu_{Levy} \\
H_a: \mu_{pop} \neq \mu_{Levy}
\end{equation}
i.e. <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;H<sub>0</sub>: The published results of Levy et al's experiments are a good approximation of the true population mean<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;H<sub>a</sub>: Either the published results of Levy et al's experiments are not a good approximation of the true population mean, or my recreation of their experiment differs in some significant way
<br>

\begin{equation}
\text{p-value} = P(\text{observed experimental results or a more extreme outcome } | H_0 \text{ true})
\end{equation}

In [None]:
# Taken from Levy et al 2017 - table 3 - results column "Multilingual SID-SGNS"
# The "Multilingual SID-SGNS" approach seems to be the one implemented in the sample code.

levy_mu = {
 'cakmak-en-tr': 0.2404,
 'cakmak-tr-en': 0.2945,
 'graca-en-es': 0.4893,
 'graca-en-fr': 0.4433,
 'graca-en-pt': 0.4047,
 'graca-es-en': 0.5015,
 'graca-fr-en': 0.4632,
 'graca-pt-en': 0.4151,
 'hansards-en-fr': 0.4091,
 'hansards-fr-en': 0.4302,
 'holmqvist-en-sv': 0.2737,
 'holmqvist-sv-en': 0.3195,
 'lambert-en-es': 0.2989,
 'lambert-es-en': 0.3049,
 'mihalcea-en-ro': 0.2514,
 'mihalcea-ro-en': 0.2753,
 'wiktionary-ar-en': 0.3082,
 'wiktionary-en-ar': 0.1605,
 'wiktionary-en-es': 0.3509,
 'wiktionary-en-fi': 0.1591,
 'wiktionary-en-fr': 0.3304,
 'wiktionary-en-he': 0.1448,
 'wiktionary-en-hu': 0.2482,
 'wiktionary-en-pt': 0.4058,
 'wiktionary-en-tr': 0.2437,
 'wiktionary-es-en': 0.3868,
 'wiktionary-fi-en': 0.2584,
 'wiktionary-fr-en': 0.3893,
 'wiktionary-he-en': 0.2403,
 'wiktionary-hu-en': 0.3372,
 'wiktionary-pt-en': 0.4376,
 'wiktionary-tr-en': 0.3080
}

In [None]:
df_sig_results = calc_t_test(df_results, levy_mu)

In [None]:
df_sig_results

In [None]:
df_sig_results.to_pickle('./' + training_corpus + '/df_sig_results.pickle')

In [None]:
writer = pd.ExcelWriter(results_spreadsheet)
df_sig_results.to_excel(writer, vecs_dir_prefix)
writer.save()

To retrieve previously saved baseline results from file, execute next block

In [None]:
df_sig_results = pd.read_pickle('./' + training_corpus + '/df_sig_results.pickle')

In [None]:
df_filtered_sig_results = df_sig_results.filter(['wiktionary-en-es',
                  'wiktionary-en-fi',
                  'wiktionary-en-fr',
                  'wiktionary-es-en',
                  'wiktionary-fi-en',
                  'wiktionary-fr-en'], 
                 axis=1)
df_filtered_sig_results.columns = ['en-es','en-fi','en-fr','es-en','fi-en','fr-en']
df_filtered_sig_results = df_filtered_sig_results.round(3)

In [None]:
print(df_filtered_sig_results.to_latex())