# Demo for single words as targets

<br></br>

<span style="color:red">Before you run the demo, make sure to follow the steps from the README.md file.</span>

<span style="color:red">If you want to learn more about the underlying implementation, use the help command.</span>

***

Load the necessary libraries, and (optionally) set the cache folder for the context-dependent models (i.e., Hugging Face transformers).

In [None]:
#import os
#os.environ['TRANSFORMERS_CACHE'] = <new_cache_folder_path>

import csv
import time
  
import numpy as np
import pandas as pd

from sources.final_model import LcpRegressor

Initialize the ridge regression model (i.e., LcpRegressor), and specify that the targets consist of single words (i.e., use_single_words=True). Enforce a strong degree of regularization (i.e., lambda_param=1200), and run the model in verbose mode (i.e., verbose=True), since this allows the detection of potential bottlenecks.

In [None]:
curr_model = LcpRegressor(use_single_words=True, lambda_param=1200, verbose=True)

Process all the behavioural norms and distributional models. Alternatively, you can process only a subset of norms, for instance by excluding those that have low predictive power and/or take too much time to load. You must assign a name to each norm/model, and it is this name that you will later use if you wish to generate predictors based on that particular norm/model. 

Load the behavioural norms from file. If you do not plan to use norms at all, you can skip this step.

In [None]:
behav_norm_names = ['Conc', 'SemD', 'Freq_CD', 'Prev', 'AoA', 'Emo', 'SensMot', 'Comp', 'MRC', 'LD']
behav_norm_filenames = ['Concreteness norms.txt', 
                        'Semantic diversity norms.txt',
                        'Frequency and contextual diversity norms.txt',
                        'Prevalence norms.txt',
                        'Age of acquisition norms.txt',
                        'Emotional norms.txt',
                        'Sensorimotor norms.txt',
                        'Complexity norms.txt',
                        'MRC norms.txt', 
                        'Lexical decision norms.txt']

curr_model.load_behav_norms(behav_norm_names, behav_norm_filenames)

Load the context-independent models from file. If you do not plan to use context-independent models at all, you can skip this step.

In [None]:
cont_indep_model_names = ['Skip-gram', 'GloVe', 'NumberBatch']
cont_indep_model_filenames = ['Skip-gram embeddings.txt',
                              'GloVe embeddings.txt',
                              'ConceptNet NumberBatch embeddings.txt']

curr_model.load_cont_indep_models(cont_indep_model_names, cont_indep_model_filenames)

Load the (pre-trained) context-dependent models, using the Hugging Face library. The classes of models (i.e., transformers) currently supported by our implementation are 'albert', 'bert', 'deberta', 'electra', and 'roberta'. Each class has one or more available models (e.g., in the case of BERT, valid ids are 'bert-base-uncased', 'bert-base-cased', 'bert-large-cased', etc.; you can find the full list at https://huggingface.co/models).

In [None]:
cont_dep_model_names = ['Albert', 'Bert', 'Deberta', 'Electra_small', 'Electra_base', 'Electra_large', 'Roberta']
cont_dep_model_ids = ['albert-base-v2', 
                      'bert-base-uncased', 
                      'microsoft/deberta-base', 
                      'google/electra-small-discriminator',
                      'google/electra-base-discriminator',
                      'google/electra-large-discriminator',
                      'roberta-base']

curr_model.load_cont_dep_models(cont_dep_model_names, cont_dep_model_ids)

Based on the previously loaded norms/models and their corresponding names, select one or more types of predictors that will be used in fitting the complexity ratings. 

In [None]:
pred_names = ['Conc', 'SemD', 'Freq_CD', 'Prev', 'AoA', 'Emo', 'SensMot', 'Comp', 'MRC', 'LD',
              'Skip-gram', 'GloVe', 'NumberBatch',
              'Albert', 'Bert', 'Deberta', 'Electra_small', 'Electra_base', 'Electra_large', 'Roberta']

curr_model.select_preds(pred_names)

Read the train and test datasets from file. Like in the case of the norms and models, you are free to provide your own set of stimuli, as long as they follow the format employed by the organizers of LCP. 

In [None]:
stimuli_train = pd.read_csv('./stimuli/lcp_single_train.tsv', delimiter='\t', quoting=csv.QUOTE_NONE, na_filter=False) 
y_train = stimuli_train['complexity'];
X_train = stimuli_train.drop(['complexity'], axis=1)

stimuli_test = pd.read_csv('./stimuli/lcp_single_test.tsv', delimiter='\t', quoting=csv.QUOTE_NONE, na_filter=False) 
y_test = stimuli_test['complexity'];
X_test = stimuli_test.drop(['complexity'], axis=1)

Fit the model to the train dataset.  

In [None]:
pred_list = curr_model.fit(X_train, y_train)  

Finally, evaluate model performance over the test dataset, using Pearson and Spearman correlation. 

In [None]:
pearson_corr, spearman_corr = curr_model.score(X_test, y_test)

print('Pearson correlation (test set): {:.2f}'.format(pearson_corr))
print('Spearman correlation (test set): {:.2f}'.format(spearman_corr))