# Texts as Vectors

## Textual Similarity

* This week we will discuss different ways of representing documents and words as vectors.

* Document vectors are useful for comparing one document to another.  E.g. in document retrieval.

* In stylometry: “The fundamental assumption in authorship attribution is that individuals have idiosyncratic and largely unconscious habits of language use, leading to stylistic similarities between texts written by the same person.” (Evert et al. 2017)

* Federalist Papers, J. K. Rowling, etc.

## Preprocessing

* How many tokens are in the corpus?

* Tokenization: if you are looking for “cat”, you want “Cat” and “cat.”, but not “scatter” or “eradicate”.

* Lemmatization: cat => cat; cats => cat;

* Then you need to find all unique tokens in the corpus (collection of documents)

## Term frequency

* Document 1: “Dog dog dog cat rat.”
* Document 2: “Cat cat dog dog mouse mouse.”
* Unique tokens: [“dog”, “cat”, “rat”, “mouse”]
* Mapping: {0: “dog”, 1: “cat”, 2: “rat”, 3: “mouse”}
* Vector 1: [3, 1, 1, 0]
* Vector 2: [2, 2, 0, 2]
* Each document is a vector in n-dimensional space, where n is the number of unique terms. 

## Vector similarity
![cats.png](attachment:cats.png)

* <span style="color:yellow; background-color:purple">Document 1: cat cat cat cat</span>
* <span style="color:green">Document 2: cat dog</span>
* <span style="color:blue">Document 3: dog dog cat dog</span>
* <span style="color:orange">Document 4: dog dog cat cat</span>
* <span style="color:white; background-color:purple">Document 5: cat cat cat dog</span>
* <span style="color:red">Document 6: dog dog dog dog</span>
* Doc 5 is closest doc to doc 1
* Doc 2 and doc 4 are “the same”

## Cosine similarity

* Cosine of the angle between two vectors.

* Each vector represents one document.					

* Each element of the vector represents one word.

* Value depends on how often the word appears (TF), or a different metric (e.g. TF-IDF).

* The result is a number between 0 and 1:
    * 0: No words in common at all.
    * 1: Same words,same relative frequencies. 

* This automatically corrects for texts of different length.  **NB.** Most methods don't do this!  Usually it is crucial to normalize word frequencies by length of text!

## Cosine similarity (Wikipedia)
![cosine.png](attachment:cosine.png)

## Inverse Document Frequency

* Not all words are equally “important”.  E.g. “the”, “of”, etc. are very frequent in many documents.  Stopwords are one-size-fits-all.

* E.g. in documents related to “the fluffy cat”,  “fluffy” and “cat” are more significant/meaningful than “the”.

* Words occurring in fewer documents are more significant:

* DF = How many documents this term appears in

* N = Total number of documents

* IDF = N/DF or more commonly = log(N/DF), where:

* The larger DF is, the smaller IDF will be.

## Example: IDF with 20 Documents
![idf.png](attachment:idf.png)

## TF-IDF

* TF-IDF is term frequency times inverse document frequency.

* A measure of the distinctiveness of at term in a corpus:  “times term appears” x “how rare is the term”.

* Measures relevance of a term to a document.

* Even if TF is large, IDF is small because it is so frequent: “by” not very relevant to most documents.

* Even if TF is small, IDF is large because it is infrequent: “fluffy” very relevant to documents it is in.

## Cosine Similarity + TF-IDF

* Gives us a way of measuring document similarity.

* Highest score => strongest similarity.

* Which documents are most similar to any given document?

* How similar are they?

* Degree of similarity depends on the corpus.

* Document Frequency depends on the contents of all documents, not just any two we compare.

## Visualizing Similarity
![heatmap.png](attachment:heatmap.png)

## Bag of Words

* “Bag of words” models ignore word order

* "Trump/tweeted/angrily/at/Biden" has identical document vector representation to "Biden/tweeted/angrily/at/Trump" even though these mean very different things.

* Sequences capture more information.

* Adjacent tokens capture reused fragments.

* E.g. the pair “Trump / tweeted” only occurs in the first  sentence.

## N-grams

* n-grams: in-order collocations of n tokens, where n=1,2,3,...

* They capture features of word usage statistically:
    * Which words tend to go together? Which don’t?

* They do not attempt to deal directly with:
    * Semantics
    * Context 
    * Grammar

* They work for any natural language.

* Shared n-grams are suggestive of semantic similarity.

## N-grams in Alice in Wonderland
![n-grams.png](attachment:n-grams.png)

## Similarity by shared n-grams
* Compute the n-grams in each document A & B
*How many of all n-grams are in both A & B?
![compare.png](attachment:compare.png)

## The Jaccard Index 
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for measuring the similarity and diversity of sample sets. 
![jaccard2.png](attachment:jaccard2.png)

## Burrow’s Delta

* Take token counts and divide by length of document to get relative frequencies.

* Choose the n most frequent tokens across the corpus.

* Then normalize the frequencies across the corpus by transforming them into z-scores:

* Make it so that the mean frequency across the corpus is 0 and the standard deviation is 1, and report z-scores (standard deviations) for each token.

* The delta score between two documents is the average of the absolute values of the differences between the z-scores for each token.

## Burrow’s Delta

* Alternatively, cosine Delta computes the cosine distance between vectors of z-scores (from -1 to 1).

* Using z-scores has the opposite effect of IDF: it reduces the influence of outliers. It identifies patterns of using certain words more or less than the mean. 

* This is useful, from the point of view of authorship attribution.  

* At this point, we can visualize the results directly, or feed the numbers into a supervised classification algorithm or an unsupervised clustering algorithm.