## Chapter 4: Textual Similarity

* Common Applications
    * Document Retrieval
    * Topic Labeling
    * Authorship Analysis

# 4.1: The Problem
* Scenario: Authorship analysis - who wrote this document?
    * Scenario: Authorship attribution
    * Scenario: Authorship verification

# 4.2: The Data
* Data:
    * The PAN network public data
    * Authorship attribution:
        * Text snippets representing one author
    * Authorship verification:
        * Create pairs of documents belonging to the same or two authors
        * How can we predict whether two documents belong to the same author?

# 4.3: Data Representation
* How can we derive a textual profile such that we can compare new texts with unknown authorship to known texts?
    * We must determine the author style
        * Word choice
        * Word order and other grammatical choices
        * Typos and abbreviations
        * The use of adjectives/adverbs for sentiment
* Representing this information
    * Document -> Segmented Document -> Feature extraction -> Vectorization -> Vectors / Segment

## 4.3.1: Segmenting documents
* For a lot of words, split the document into fixed sized blocks of a number of words, then represent as a grouped vector.

## 4.3.2: Word-level information
* Hashing Trick -> Creates vectors of a specified length
* Option 1: 
    * Vectorize the documents using bag-of-words
    * Segment the documents using word n-grams
* Option 2: 
    * Vectorize the documents using character n-grams
    * Segment the documents based on character n-grams
* If we preprocess our data explicitly, a CNN should be able to detect the higher-order n-grams.

## 4.3.3: Subword-level information

# 4.4: Models for Measuring Similarity

## 4.4.1: Authorship Attribution
* Goal: Let's train a deed multilayer perceptron + CNN on the authorship attribution task

## Multilayer Perceptron
* Train a classifier MLP on the authors in the PAN dataset for task A.
    * Embedding connects to a Dense layer
    * Dense layer feeds into a Dropout layer
        * Dropout layer randomly deactivates neurons to omit over-fitting
* We get OK accuracy for this, but adding more and more character limits to the n-grams help us lose valuable lexical information.
* Altering the model to the explicit word n-grams approach leads to our lexicon exploding exponentially!

## CNNs for Text
* Authorship is expressed by many features scattered across a document. 
    * CNNs are good at picking up these features!
* Train a CNN on the authors in the PAN dataset for task A.
    * Embedding connects to a Dense layer
    * Dense layer feeds into a Convolution1D layer
* Certain choices for feature representation have a direct bearing on model complexity + resource demands!

## 4.4.2: Authorship Verification
* Siamese networks are networks with (usually 2) sister subnetworks + an arbiter
    * One single model combineds the two input layers such that they share embeddings and processing!
    * The two sister networks share the same weights!
    * The latent representations are used by the arbiter
        * The arbiter computes a contrastive loss between the representation
        * The overall network learns a threshold to decide whether the measured distance indicates similarity
            * Distance is measured based on an exponentiated negative sum of differences
    * The network trains on pairs of similar or dissimlar texts.

## 4.5: Summary
* Textual similarity use cases include authorship attribution and authorship verification
* Representational choices can have hefty ramifications for model complexity. Some models may bypass some of that burden by their intrinsic organization (like MLPs compared to CNNs).
* Siamese networks can be used to determine textual similarity.
* Lexican information only is not sufficient for establishing textual similarity. Style is expressed both lexically and formally, in terms of word combinations.
* CNNs are good in emphasizing sequential information. They outperformed a simple MLP in our authorship attribution experiments.
* Siamese networks can be used for textual similarity in authorship verification.