# RAMP starting kit : Who Wrote This ?

_Authors: Romain AVOUAC, Jaime COSTA, Adrien DANEL, Guillaume DESFORGES, José-Louis IMBERT, Slimane THABET_

**TODO** : ajouter l'illustration wordcloud

## Table of Contents

0. [Introduction](#Introduction)
1. [Data](#Data)
2. [Score metric](#Score-metric)
3. [Library requirements](#Library-requirements)
4. [Basic text preprocessing](#Basic-text-preprocessing)
5. [Exploratory analysis](#Exploratory-analysis)
6. [Predictions](#Predictions)
6. [Ramp workflow](#Ramp-workflow)
9. [More information](#More-information)

## Introduction

Authorship identification is the task of recognizing who the author of a document is.
It is part of the Natural Language Processing (NLP) kind of tasks.

Being able to identify the author of a document presents several applications, such as detecting plagiarism or finding the author of an anonymous document.
Archives all around the world are full of documents for which knowing the author would be invaluable knowledge for historical studies.
Furthermore, [the multiple plagiarism scandals](https://lithub.com/12-literary-plagiarism-scandals-ranked/) in literature could be solved with an algorithm.
For instance the authorship of Moliere or Shakespeare has been debated from the 19th century (more on this [here](https://fr.wikipedia.org/wiki/Paternit%C3%A9_des_%C5%93uvres_de_Moli%C3%A8re) and [here](https://fr.wikipedia.org/wiki/Paternit%C3%A9_des_%C5%93uvres_de_Shakespeare)).

This task has also an instructive purpose.
It is a way to investigate if NLP algorithms are able to capture bot only the semantics, but also the literary style of a document.

## Data

We will limit ourselves to a selection of French novelists from the 19th century.

**TODO**:
* pourquoi le 19eme ?
* quels auteurs ? pourquoi ?
* quels textes ? pourquoi ?

**TODO** lister
* les fichiers
* ce à quoi ils servent
* leurs colonnes et la signification

## Score metric

We want to predict the author of a given piece of litterature from a given set of authors.
In machine learning, this type of problem is called "multiclass classification" problem, that is for each item we predict the class that it belongs to.
Here, the items are the documents (one or more paragraphs) and the classes are the authors.

In order to evaluate the performance of an algorithm solving this type of problems, one could propose the *precision* of the algorithm.
The precision of an algorithm in the prediction of a given class is defined as the number of right predictions on that class divided by the number of items where the algorithm predicted it.
Then we could compute for instance the mean precision of the algorithm on all the classes.

On the other hand, one could say that the *recall* of the algorithm is also important, or its *accuracy*.

Most of the time, an algorithm can be tweaked to offer a better precision or a better recall, but not both at the same time - there is no free lunch.
There is a tradeoff to be made, usually depending on the application domain.
For example, in a medical team you would want as little false negatives.

In order to evaluate the model, we propose to use the F1 metric.

**TODO** continuer en présentant la F1.

## Library requirements

To run this starting kit, the following libraries are required : 
- `numpy`
- `pandas`
- `nltk`
- `plotly`
- `plotly_express`
- `matplotlib`
- `seaborn`
- `scikit-learn`
- `gensim`

They can be installed all at once using the `requirements.txt` file with pip :

In [6]:
# !pip install -r requirements.txt

In order to make submissions to the challenge, the `ramp-workflow` library is also needed. It can be installed from GitHub using pip :

In [4]:
# !pip install git+https://github.com/paris-saclay-cds/ramp-workflow

## Exploratory analysis

### Download and load data

In [None]:
df = pd.read_csv("")

### Basic text preprocessing

NLP data is special in the sense that it is unstructured.
Structured data are tables where each item is a set of key-value pairs, each pair reflecting a feature of the item.
In this challenge, each item is a document of natural text.
Most algorithm can't process those raw text as a sequence of characters, and it is part of the job of a data scientist to design the proper data processing pipelines.

Usually, it starts with a tokenization step where the document is cut into pieces, such as words.
Transforming the data from a sequence of characters to a sequence of words can then help engineering actual features for each document.

Below is a simple tokenization :

**TODO** ajouter la tokenization

**TODO** écrire quelques limites/suggestions d'amélioration

### Basic features and statistics

In [None]:
# TODO adrien

### Extracting meaning from words with the LDA

LDA stands for Latent Dirichlet Allocation.
It is a simple yet powerful model that has been used in NLP.

**TODO** continuer à présenter

In [None]:
download('punkt', quiet=True)
download("stopwords")
_punctuation = '.?!:;&()`"\'@°_-~'
CUSTOM_STOPWORDS = ["--", ".", ",", "!", ";", "’", ":", "?", "...", "'", "«", "»", '(', ')', '[', ']']
other_stopwords = ['comme', 'elles', "c'était", "qu'il", "qu'elle", 'où', 'car', 'sans', 'vers', 'encore', 'cette', 
                  'a', 'faire', 'fait', 'fais', 'à', 'donc', 'tout', 'cet', 'là', 'ceux', 'leur', 'leurs', 'parmi', 
                  'puis', 'ensuite', 'alors', "qu'ils", "qu'elles", "m'en", "j'en", 'dit-il', 'dit-elle', 'répondit',
                  "s'ils", 'vont', "s'il", "n'est", 'pourquoi', "lorsqu'il", "lorsqu'elle", "presque", 'lorsque', 
                  "contre", 'toujours', 'plus', 'dès', 'autre', 'tous', 'tout', 'si', "j'ai", "tous", 'tout', 'toutes',
                  'pourtant', "c'est", "cela", "être", "jamais", "s'était", "l'avait"]

In [None]:
def make_stopwords_remover(stopwords):
    def stopwords_remover(words):
        return [word for word in words if word not in stopwords]

    return stopwords_remover

def flatten_count(accumulator, items):
    for item in items:
        accumulator[item] = accumulator.get(item, 0) + 1
    return accumulator

In [None]:
french_stopwords = set(stopwords.words("french")).union(CUSTOM_STOPWORDS).union(other_stopwords)
df["tokenized"] = df["paragraph"].str.lower().map(word_tokenize)
df["tokenized"] = df["tokenized"].map(make_stopwords_remover(french_stopwords))
docs = df["tokenized"].values

In [None]:
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)
dictionary.filter_extremes(no_below=20, no_above=0.5)

corpus = [dictionary.doc2bow(doc) for doc in docs]

from gensim.models import LdaModel

# Set training parameters.
num_topics = 50
chunksize = 2000
passes = 20
iterations = 400
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

In [None]:
top_topics = model.top_topics(corpus) 

top_words_topics = []
for i in range(len(top_topics)):
    top_words_topics.append([])
    for p,w in top_topics[i][0]:
        top_words_topics[i].append(w)

In [None]:
documents_embedding = np.zeros((len(corpus), num_topics))
for i in range(documents_embedding.shape[0]):
    topics = model.__getitem__(corpus[i])
    for t,p in topics:
        documents_embedding[i,t] = p

from sklearn.manifold import TSNE
sample = np.random.choice(documents_embedding.shape[0], size=4000, replace=False)
documents_embedding_to_plot = documents_embedding[sample]

tsne = TSNE(n_components=2)
embeddings = tsne.fit_transform(documents_embedding_to_plot)


In [None]:
authors = df['author'].iloc[sample].values
authors_list = ['Balzac', 'Daudet', 'Dumas', 'Hugo', 'Flaubert', 'Maupassant', 'Stendhal', 'Verne', 'Vigny', 'Zola']
colors = ['red', 'blue', 'green', 'orange', 'yellow', 'magenta', 'chartreuse', 'gold', 'lightsalmon', 'cyan']

to_plot = authors_list

plt.figure(figsize=[12,12])
for i,author in enumerate(authors_list):
    if author in to_plot:
        plt.scatter(embeddings[authors==author,0], embeddings[authors==author,1], color=colors[i], label=author, alpha=0.5)
plt.legend()
plt.show()

## Predictions

Using the previous studies, we can build a predictive pipeline that can learn to classify documents by authors.

In [18]:
# Load data

import problem

X_train, y_train = problem.get_train_data(sep='|')
X_test, y_test = problem.get_test_data(sep='|')

### Feature extractor

In [19]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer

class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.vectorizer = TfidfVectorizer(strip_accents='ascii',
                                          max_df=0.7)

    def fit(self, X_df, y=None):
        self.vectorizer.fit(X_df['paragraph'])
        return self

    def transform(self, X_df):
        X_preprocessed = self.vectorizer.transform(X_df['paragraph'])
        return X_preprocessed
    
feature_extractor = FeatureExtractor()

### Classifier

In [20]:
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression

class Classifier(BaseEstimator):
    def __init__(self):
        self.classifier = LogisticRegression(solver='lbfgs', max_iter=1000,
                                             multi_class='multinomial')

    def fit(self, X, y):
        self.classifier.fit(X, y)
        return self

    def predict(self, X):
        y_pred = self.classifier.predict(X).astype(int)
        return y_pred

    def predict_proba(self, X):
        proba_pred = self.classifier.predict_proba(X)
        return proba_pred
    
classifier = Classifier()

### Score metric

In [24]:
from rampwf.score_types.classifier_base import ClassifierBaseScoreType
from sklearn.metrics import f1_score

class F1Score(ClassifierBaseScoreType):
    is_lower_the_better = False
    minimum = 0.0
    maximum = 1

    def __init__(self, name="F1-score", precision=2):
        self.name = name
        self.precision = precision

    def __call__(self, y_true, y_pred):
        return f1_score(y_true, y_pred, average='micro')
    
scorer = F1Score()

### Score on test data

In [27]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

clf = Pipeline(steps=[
    ('feature_extractor', feature_extractor),
    ('classifier', classifier)])

clf.fit(X_train, y_train)

y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

train_score = scorer(y_true=y_train, y_pred=y_train_pred)
test_score = scorer(y_true=y_test, y_pred=y_pred)

print('F1 score on train set : ', train_score)
print('F1 score on test set : ', test_score)

F1 score on train set :  0.9135629846192957
F1 score on test set :  0.487004986799648


## Ramp workflow

### Submission structure

Each submission should be in it's own folder within the `submissions` folder (e.g. `submissions/my_submission`). The submission directory should contain 2 files:

* `feature_extractor.py` - this should implement a feature extractor with `fit()` and `transform()` methods
* `classifier.py` - this should implement a classifier with `fit()` and `predict()` methods

See `submissions/starting_kit` for an example.

### Local testing (before submission)

The `ramp-workflow` library provides a unit test - `ramp_test_submission` - to check whether a submission works locally before submitting it to the server. This command will test on files in [`submissions/starting_kit`](/submissions/starting_kit) by default. To specify testing on a different folder use the flag `--submission`. For example to run the test on `submissions/solution1` use: `ramp_test_submission --submission solution1`.

In [33]:
!ramp_test_submission --submission starting_kit

[38;5;178m[1mTesting Who wrote this? Predicting the author of a paragraph[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining submissions/starting_kit ...[0m
[38;5;178m[1mCV fold 0[0m
	[38;5;178m[1mscore  F1-score       time[0m
	[38;5;10m[1mtrain[0m      [38;5;10m[1m0.91[0m  [38;5;150m46.961329[0m
	[38;5;12m[1mvalid[0m      [38;5;12m[1m0.82[0m   [38;5;105m3.172486[0m
	[38;5;1m[1mtest[0m       [38;5;1m[1m0.48[0m   [38;5;218m1.174487[0m
[38;5;178m[1mCV fold 1[0m
	[38;5;178m[1mscore  F1-score       time[0m
	[38;5;10m[1mtrain[0m      [38;5;10m[1m0.91[0m  [38;5;150m44.838507[0m
	[38;5;12m[1mvalid[0m      [38;5;12m[1m0.81[0m   [38;5;105m3.544223[0m
	[38;5;1m[1mtest[0m       [38;5;1m[1m0.49[0m   [38;5;218m1.447141[0m
[38;5;178m[1mCV fold 2[0m
	[38;5;178m[1mscore  F1-score       time[0m
	[38;5;10m[1mtrain[0m      [38;5;10m[1m0.91[0m  [38;5;150m49

## More information

You can find more information in the README of the [ramp-workflow](https://github.com/paris-saclay-cds/ramp-workflow) library.