<a href="https://colab.research.google.com/github/ccaballeroh/MCPR-2021/blob/main/04Extraction_Most_Relevant_Features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from pathlib import Path
import sys

IN_COLAB = "google.colab" in sys.modules

In [None]:
import warnings
warnings.filterwarnings("ignore")
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive/')
    ROOT = Path(r"./drive/My Drive/Translator-Attribution")
    sys.path.insert(0,f"{ROOT}/")
else:
    from helper.analysis import ROOT

# Extraction of Most Relevant Features

On this notebook, we extract the most relevant features in the classification process for each translator. In order to do this, we can retrieve the learned weights from a linear classifier (e.g., Logistic Regression, although a Support Vector Machine using a linear *kernel* also have those properties as well as the Naïve Bayes classifier) and get the $n$ largest. The corresponding $n$ features would thus be the most relevant for each class. In case of a binary classifier, the $n$ largest weights would correspond to the *positive* class, whereas the $n$ most negative weights would correspond to the *negative* class.

Since scikit-learn trains $N$ binary classifiers when given an N-class multiclass problem, we can retrieve the $n$ largest weights&mdash;and their corresponding features&mdash;for each classifier. This notebook saves to disk the $n$ most relevant features for each translator in the corpora for each feature set for a logistic regression classifier. The results are saved as bar plots and also tabular ($\LaTeX$) in the `results\figs\most` and `results\tables` folders respectively.

In [None]:
from helper.features import convert_data, plot_most_relevant, train_extract_most_relevant, save_tables

These are the files to process. They are the entirety of the feature sets obtained using [01Processing](./01Processing.ipynb).

In [None]:
from helper.analysis import JSON_FOLDER

## Most Relevant Features



We use $\chi^2$ statistic to select the $k$ most distinctive features in each feature set to train a Logistic Regression classifier. Then we extract the $n$ most important feature for each class (i.e., translator).

In [None]:
most_relevant = {}
num_of_features = 25
n_most_relevant = 10
args = {"k":num_of_features, "n":n_most_relevant}

for author in ["Ibsen", "Quixote"]:
    features_files = [file for file in JSON_FOLDER.iterdir() if file.name.startswith(author)]
    for file in features_files:
        data = convert_data(file=file)
        data = {**data, **args}
        most_relevant[file.stem] = train_extract_most_relevant(**data) 

        for translator in data["encoder"].classes_:
            plot_most_relevant(data=most_relevant[file.stem], translator=translator, file=file)
            df = most_relevant[file.stem][translator]
            save_tables(df=df, translator=translator, file=file)

<Figure size 432x288 with 0 Axes>

## Results

As an example, we can compare bigrams among the parallel translations of Ibsen (i.e., *Ghosts*).

In [None]:
key = "_".join(
    ["Ibsen",
     "Ghosts",
     "2grams",
    ])

In [None]:
for translator in most_relevant[key].keys():
    print(translator, ":\n", most_relevant[key][translator], 2*"\n")

Sharp :
         Feature    Weight
1       was the  0.615094
2         as if  0.568035
3         up to  0.450067
4        do n't  0.439230
5   manders and  0.429741
6   manders but  0.399228
7         it is  0.339203
8        at all  0.292298
9          i am  0.077720
10     going to  0.037981 


Archer :
        Feature    Weight
1         i 'm  1.091468
2        i 've  0.913881
3       do not  0.747682
4    PROPN why  0.698373
5      can not  0.638415
6   PROPN then  0.623570
7    not PROPN  0.501947
8    very well  0.425225
9     there 's  0.407276
10   to morrow  0.362066 




Another example, from the other corpus. We show the 10 most distinctive cohesive markers sorrounded by their corresponding punctuation marks for each translator.

In [None]:
key = "_".join(
    ["Quixote",
     "cohesive",
     "punct"       
    ])
for translator in most_relevant[key].keys():
    print(translator, ":\n", most_relevant[key][translator], 2*"\n")

Jarvis :
         Feature    Weight
1   . in short,  2.111955
2        ; and,  1.072222
3   immediately  0.429081
4       , since  0.389144
5          also  0.353554
6         ; and  0.335309
7         : and  0.330906
8         . and  0.326769
9         " and  0.209485
10      , while  0.192520 


Ormsby :
        Feature    Weight
1   , however,  0.913814
2      , while  0.624157
3         "and  0.397329
4         "but  0.335526
5        " and  0.280612
6          and  0.237583
7       'there  0.033889
8     although  0.001344
9        , and -0.084581
10  , although -0.167373 


Shelton :
        Feature    Weight
1         'and  1.651980
2        , yet  1.586467
3         'but  0.972124
4   , although  0.819939
5     likewise  0.737546
6          yet  0.640972
7       'there  0.581394
8      , since  0.428931
9    therefore  0.334524
10      , and,  0.325530 


