# Demo for Italian stimuli

<br></br>

<span style="color:red">Before you run the demo, make sure to follow the steps from the README.md file.</span>

<span style="color:red">If you want to learn more about the underlying implementation, use the help command.</span>

***

Load the necessary libraries, and (optionally) set the cache folder for the context-dependent models (i.e., Hugging Face transformers).

In [None]:
#import os
#os.environ['TRANSFORMERS_CACHE'] = <new_cache_folder_path>

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold

from sources.final_model import ConcretextRegressor

Initialize the ridge regression model (i.e., ConcretextRegressor). Here we have two important options, namely to use only the Italian stimuli (i.e., include_translation=False), or both the Italian stimuli and their English translation (i.e., include_translation=True). Also, we enforce a strong degree of regularization (i.e., lambda_param=500), and run the model in verbose mode (i.e., verbose=True), since this allows us to detect potential bottlenecks.

In [None]:
curr_model = ConcretextRegressor(include_translation=True, lambda_param=500, verbose=True)

Process all the behavioural norms and distributional models. Alternatively, you can process only a subset of norms have low predictive power and/or take too much time to load. You must assign a name to each norm/model, and it is this name that you will later use if you wish to generate predictors based on that particular norm/model. 

Load the behavioural norms from file. If you do not plan to use norms at all, you can skip this step.

In [None]:
behav_norm_names = ['Conc', 'SemD', 'Freq_CD', 'AoA', 'Emo', 'SensMot']
behav_norm_filenames = ['Concreteness norms - English.txt', 
                       'Semantic diversity norms - English.txt',
                       'Frequency and contextual diversity norms - English.txt',
                       'Age of acquisition norms - English.txt',
                       'Emotional norms - English.txt',
                       'Sensorimotor norms - English.txt']

curr_model.load_behav_norms(behav_norm_names, behav_norm_filenames)

Load the context-independent models from file, then reduce the dimensionality of the models and (optionally) their concatenation. Alternatively, you can decide to omit the dimensionality reduction step, or reduce the dimensionality of only a subset of models. If you do not plan to use context-independent models at all, you can skip this step.

In [None]:
curr_model.load_cont_indep_models(['FastText', 'NumberBatch'], 
                                  ['FastText embeddings - Italian.txt',
                                  'ConceptNet NumberBatch embeddings - Italian.txt'],
                                  include_concat=True)

curr_model.reduce_dims_cont_indep_models(['FastText', 'NumberBatch'], include_concat=True, n_pcs=30)

Load the (pre-trained) context-dependent models, using the Hugging Face library. The classes of models (i.e., transformers) currently supported by our implementation are 'albert' (English), 'alberto' (Italian'), 'bart' (English), 'bert' (English, multilingual), 'gpt-2' (English), and 'roberta' (English). Each class has one or more available models (e.g., in the case of BERT, valid ids are 'bert-base-uncased', 'bert-base-cased', 'bert-large-cased', 'bert-base-multilingual-uncased', etc.; you can find the full list at https://huggingface.co/models).

In [None]:
curr_model.load_cont_dep_models(['albert', 'alberto', 'bart', 'bert', 'gpt-2', 'roberta'], 
                                ['albert-base-v2', 
                                 'm-polignano-uniba/bert_uncased_L-12_H-768_A-12_italian_alberto',
                                 'facebook/bart-base', 
                                 'bert-base-uncased', 
                                 'gpt2',
                                 'roberta-base']) 

Based on the previously loaded norms/models and their corresponding names, select one or more types of predictors that will be used in fitting the concreteness ratings. For each selected norm/model, you need to specify whether the predictors should be derived from the inflected form of the target (i.e., the word from position INDEX in TEXT), and/or the uninflected one (i.e., the word from TARGET). Also, for any given norm/model, you can include both the inflected and uninflected versions of the predictors (e.g., by setting pred_names = ['MyNormOrModel', 'MyNormOrModel'], target_is_inflected = [True, False]).

Importantly, you need to specify whether you want to generate predictors from the original Italian stimuli (i.e., by using the select_preds_original() method), or from their English translation (i.e., by using the select_preds_translated() method). You have use either both options, or a single option.

In [None]:
curr_model.select_preds_original(['FastText', 'NumberBatch', 'Cont_Indep_Models_Concat', 'alberto'], 
                           [False, False, False, True])

curr_model.select_preds_translated(['Conc', 'Conc', 'SemD', 'SemD', 'Freq_CD', 'Freq_CD',
                                    'AoA', 'AoA', 'Emo', 'Emo', 'SensMot', 'SensMot',
                                    'FastText', 'albert', 'bart', 'bert', 'gpt-2', 'roberta'], 
                                   [True, False, True, False, True, False, 
                                    True, False, True, False, True, False, 
                                    True, True, True, True, True, False])

Read the (trial) stimuli from file. Like in the case of the norms and models, you are free to provide your own set of stimuli, as long as they follow the format employed by the organizers of CONcreTEXT. 

In [None]:
stimuli = pd.read_csv('./stimuli/CONcreTEXT_trial_IT.tsv', sep='\t')

stimuli_y = stimuli['MEAN'];
stimuli_X = stimuli.drop(['MEAN'], axis=1)

Finally, test the selected predictors through 5-fold cross-validation. Model performance is measured via the Pearson and Spearman correlations. 

If you decide to use the English translation of the Italian stimuli, please keep in mind that the translation process can be rather slow, such that each fold of the cross-validation is likely to take around 5-10 minutes.     

In [None]:
n_splits = 5

kf = KFold(n_splits, shuffle=True)

res_pearson = []
res_spearman = []

for train_index, test_index in kf.split(stimuli_X):
    
    X_train, X_test = stimuli_X.iloc[train_index,:], stimuli_X.iloc[test_index,:]
    y_train, y_test = stimuli_y[train_index], stimuli_y[test_index]

    pred_list = curr_model.fit(X_train, y_train)  
    pearson_corr, spearman_corr = curr_model.score(X_test, y_test)
    
    print('Pearson correlation: {:.2f}'.format(pearson_corr))
    print('Spearman correlation: {:.2f}'.format(spearman_corr))
    print('\n\n')
    
    res_pearson.append(pearson_corr)
    res_spearman.append(spearman_corr) 
        
print('Mean correlation (Pearson): {:.2f}'.format(np.mean(res_pearson)))
print('Mean correlation (Spearman): {:.2f}'.format(np.mean(res_spearman)))