<a href="https://colab.research.google.com/github/ccaballeroh/Translator-Attribution/blob/master/03Most_important_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extraction of Most Relevant Features

On this notebook, we extract the most relevant features in the classification process for each translator. In order to do this, we can retrieve the learned weights from a linear classifier (e.g., Logistic Regression, although a Support Vector Machine using a linear *kernel* also have those properties as well as the Naïve Bayes classifier) and get the $n$ largest. The corresponding $n$ features would thus be the most relevant for each class. In case of a binary classifier, the $n$ largest weights would correspond to the *positive* class, whereas the $n$ most negative weights would correspond to the *negative* class.

Since scikit-learn trains $N$ binary classifiers when given an N-class multiclass problem, we can retrieve the $n$ largest weights&mdash;and their corresponding features&mdash;for each classifier. This notebook saves to disk the $n$ most relevant features for each translator in the corpora for each feature set and for three classifiers: logistic regression, linear support vector machine, and a naïve Bayes classifier. The results are saved as bar plots and also tabular (CSV, HTML and, $\LaTeX$) in the `results\figs\most` and `results\tables` folders respectively.

Also on this Notebook, there's code for generating the confusion matrices product of training a Logistic Regression classifier on the *entire* dataset. We train on the entire dataset because we have proven already&mdash;via 10-fold cross-validation&mdash;that the accuracy of the classifier is high enough. The confusion matrices are generated for each feature set and are also saved to disk in the `results\figs\cm` folder.


**NOTE:** This notebook can be run on Google Colab after having followed the instructions found in the [README](./README.md) file in this repository.

In [None]:
from pathlib import Path
import sys

IN_COLAB = "google.colab" in sys.modules

In [None]:
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive/')
    ROOT = Path(r"./drive/My Drive/Translator-Attribution")
    sys.path.insert(0,f"{ROOT}/")
    import warnings
    warnings.filterwarnings("ignore")
else:
    from helper.analysis import ROOT

In [None]:
from helper.features import convert_data, plot_most_relevant, plot_confusion_matrix, train_extract_most_relevant, save_tables

These are the files to process. They are the entirety of the feature sets obtained using [01Processing](./01Processing.ipynb).

In [None]:
from helper.analysis import JSON_FOLDER
features_files = [file for file in JSON_FOLDER.iterdir() if file.name.startswith("features")]

## Most Relevant Features

The next cells define a couple of functions to generate and save the bar plots and tabular data of the $n=15$ most relevant features in the classification process for each translator and each feature set using three classifiers: Logistic Regression, Linear Support Vector Machine, and Naïve Bayes.

To do feature selection using the $\chi^2$ statistic, leave the following cell to `True`. Otherwise, change it to `False`.



In [None]:
feature_selection = True

In [None]:
for model_name in ["LogisticRegression", "SVM", "NaiveBayes"]:
    for author in ["Ibsen", "Quixote"]:
        for file in [file for file in features_files if author in file.name]:
            data = convert_data(file=file)
            args = {
                "model_name" : model_name,
                "X":data["X"],
                "y":data["y"],
                "encoder":data["encoder"],
                "dict_vectorizer":data["dict_vectorizer"],
                "feature_selection":feature_selection
                
            }
            exp_results = train_extract_most_relevant(**args)            
            most_relevant = exp_results["most_relevant"]

            for translator in data["encoder"].classes_:
                plot_most_relevant(data=most_relevant, translator=translator, model=model_name, file=file)
                df = most_relevant[translator]
                save_tables(df=df, translator=translator, file=file, model_name=model_name)
            

## Confusion Matrices

The following code generates the Confusion Matrices for all the feature sets using a logistic regression classifier.

In [None]:
for author in ["Ibsen", "Quixote"]:
    for file in [file for file in features_files if author in file.name]:
        data = convert_data(file=file)
        X = data["X"]
        y = data["y"]
        encoder = data["encoder"]
        plot_confusion_matrix(X=X, y=y, encoder=encoder, file=file)