<a href="https://colab.research.google.com/github/ccaballeroh/Translator-Attribution/blob/master/03Most_important_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extraction of Most Relevant Features

On this notebook, we extract the most relevant features in the classification process for each translator. In order to do this, we can retrieve the learned weights from a linear classifier (e.g., Logistic Regression, although a Support Vector Machine using a linear *kernel* also have those properties as well as the Naïve Bayes classifier) and get the $n$ largest. The corresponding $n$ features would thus be the most relevant for each class. In case of a binary classifier, the $n$ largest weights would correspond to the *positive* class, whereas the $n$ most negative weights would correspond to the *negative* class.

Since scikit-learn trains $N$ binary classifiers when given an N-class multiclass problem, we can retrieve the $n$ largest weights&mdash;and their corresponding features&mdash;for each classifier. This notebook saves to disk the $n$ most relevant features for each translator in the corpora for each feature set and for three classifiers: logistic regression, linear support vector machine, and a naïve Bayes classifier. The results are saved as bar plots and also tabular (CSV, HTML and, $\LaTeX$) in the `results\figs\most` and `results\tables` folders respectively.

Also on this Notebook, there's code for generating the confusion matrices product of training a Logistic Regression classifier on the *entire* dataset. We train on the entire dataset because we have proven already&mdash;via 10-fold cross-validation&mdash;that the accuracy of the classifier is high enough. The confusion matrices are generated for each feature set and are also saved to disk in the `results\figs\cm` folder.


**NOTE:** This notebook can be run on Google Colab after having followed the instructions found in the [README](./README.md) file in this repository.

In [0]:
from pathlib import Path
import sys

IN_COLAB = "google.colab" in sys.modules

In [0]:
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive/')
    ROOT = Path(r"./drive/My Drive/Translator-Attribution")
    sys.path.insert(0,f"{ROOT}/")
    import warnings
    warnings.filterwarnings("ignore")
else:
    from helper.analysis import ROOT

In [0]:
from collections import defaultdict
from helper.analysis import get_dataset_from_json
from helper.analysis import JSON_FOLDER
from helper.utils import return_n_most_important
from pathlib import Path
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.utils import shuffle
from sklearn.utils.multiclass import unique_labels
from typing import Dict, List
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

The following cell assigns the locations where to save the tables and figures. It created the folders in case they don't exist yet.

In [0]:
RESULTS_FOLDER = Path(fr"{ROOT}/results/")
if not RESULTS_FOLDER.exists():
    RESULTS_FOLDER.mkdir()

TABLES_FOLDER = RESULTS_FOLDER / "tables"
if not TABLES_FOLDER.exists():
    TABLES_FOLDER.mkdir()

FIGS_FOLDER = RESULTS_FOLDER / "figs"
if not FIGS_FOLDER.exists():
    FIGS_FOLDER.mkdir()

CONF_MAT_FOLDER = FIGS_FOLDER / "cm"
if not CONF_MAT_FOLDER.exists():
    CONF_MAT_FOLDER.mkdir()

MOST_RELEVANT_FOLDER = FIGS_FOLDER / "most"
if not MOST_RELEVANT_FOLDER.exists():
    MOST_RELEVANT_FOLDER.mkdir()

These are the files to process. They are the entirety of the feature sets obtained using [01Processing](./01Processing.ipynb).

In [0]:

features_files = [file for file in JSON_FOLDER.iterdir() if file.name.startswith("features")]

## Most Relevant Features

The next cells define a couple of functions to generate and save the bar plots and tabular data of the $n=15$ most relevant features in the classification process for each translator and each feature set using three classifiers: Logistic Regression, Linear Support Vector Machine, and Naïve Bayes.

To do feature selection using the $\chi^2$ statistic, leave the following cell to `True`. Otherwise, change it to `False`.



In [0]:
FEATURE_SELECTION = True

In [0]:


sns.set_style("whitegrid")

def plot_most_relevant(
    *, data: Dict[str, pd.DataFrame], translator: str, model: str, file: Path
) -> None:
    """Saves a bar plot of the most relevant features for a translator using a classifier.

    The function takes a list of data frames with the n most relevant weights and features
    for a classifier for a translator.

    Parameters:
    data: Dict[str, pd.DataFrame]  - The key is the translator name and the DataFrame
                                 contains two Series: 'Weight' and 'Feature'
    translator: str             - Name of the translator
    model: str                  - Name of the model (classifier) used
    file: Path                  - Feature set used to train the model

    Returns:
    None
    """
    plot = sns.barplot(
        x=data[translator]["Weight"], y=data[translator]["Feature"], palette="cividis",
    )
    features = " ".join(file.stem.split("_")[1:])
    plot.set(title=f"{translator} - {model} - {features}")
    fig = plot.get_figure()
    fig.savefig(
        MOST_RELEVANT_FOLDER / f"{file.stem}_{translator}_{model}.png", bbox_inches="tight",
    )
    fig.clf()

def save_tables(*, df:pd.DataFrame, translator:str, file:Path, model_name:str)-> None:
    """Saves to disk the tabular data of the n most relevant features of a classifier.

    Takes a DataFrame containing the n most relevant features and their weights.

    Parameters:
    df: pd.DataFrame        - Contains two series: 'Weights' and 'Features'
    translator: str         - Name of the translator
    file: Path              - Feature set used to train the classifier
    model_name: str         - Name of the classifier used

    Returns:
    None
    """
    df.to_csv(TABLES_FOLDER / f"{file.stem}_{translator}_{model_name}.csv", float_format='%.4f')

    latex = df.to_latex(float_format=lambda x: '%.4f' % x)
    with open(TABLES_FOLDER /f"{file.stem}_{translator}_{model_name}.tex", "w") as f:
        f.write(latex)
    
    html = df.to_html(float_format='%.4f')
    with open(TABLES_FOLDER /f"{file.stem}_{translator}_{model_name}.html", "w") as f:
        f.write(html)

In [0]:
for model_name in ["LogisticRegression", "SVM", "NaiveBayes"]:
    for author in ["Ibsen", "Quixote"]:
        for file in [file for file in features_files if author in file.name]:
            X_dict, y_str = get_dataset_from_json(file)
            
            v = DictVectorizer(sparse=True)
            encoder = LabelEncoder()         
            
            X, y = v.fit_transform(X_dict), encoder.fit_transform(y_str)

            if FEATURE_SELECTION:            
                chi2_selector = SelectKBest(chi2, k=50)
                X = chi2_selector.fit_transform(X, y)
                all_names = np.array(v.get_feature_names())
                feature_names = list(all_names[chi2_selector.get_support()])
            else:
                feature_names = v.get_feature_names()


            X_, y_ = shuffle(X, y, random_state=24)

            if model_name == "LogisticRegression":
                model = LogisticRegression()
            elif model_name == "SVM":
                model = LinearSVC()
            elif model_name == "NaiveBayes":
                model = MultinomialNB()
            else:
                raise NotImplementedError

            clf = model.fit(X_, y_)

            most_relevant = return_n_most_important(
                                                    clf=clf,
                                                    feature_names=feature_names,
                                                    encoder=encoder,
                                                    n=15
                            )

            for translator in encoder.classes_:
                plot_most_relevant(data=most_relevant, translator=translator, model=model_name, file=file)
                df = most_relevant[translator]
                save_tables(df=df, translator=translator, file=file, model_name=model_name)
            

## Confusion Matrices

The following code generates the Confusion Matrices for all the feature sets using a logistic regression classifier.

In [0]:
sns.set(font_scale=1.4)
for author in ["Ibsen", "Quixote"]:
    for file in [file for file in features_files if author in file.name]:
        X_dict, y_str = get_dataset_from_json(file)
        v = DictVectorizer(sparse=True)
        encoder = LabelEncoder()
        
        X, y = v.fit_transform(X_dict), encoder.fit_transform(y_str)
        
        if FEATURE_SELECTION:            
            chi2_selector = SelectKBest(chi2, k=50)
            X = chi2_selector.fit_transform(X, y)

        X_, y_ = shuffle(X, y, random_state=24)    
        
        log_model = LogisticRegression()

        y_pred = cross_val_predict(log_model, X_, y_, cv=10)
        cm = confusion_matrix(y_, y_pred, labels=unique_labels(y_))

        df = pd.DataFrame(cm, index=encoder.classes_, columns=encoder.classes_)

        cm_plot = sns.heatmap(df, annot=True, cbar=None, cmap="Blues", fmt="d", annot_kws={"size":18})
        plt.title(f"{' '.join(file.stem.split('_')[1:])}")
        plt.tight_layout()
        plt.ylabel("True translator")
        plt.xlabel("Predicted translator")
        plt.savefig(CONF_MAT_FOLDER/f"cm_{file.stem}.png", bbox_inches="tight", )
        plt.clf()