Natural Language Processing (NLP) is the intersection of computer science, linguistics and machine learning that is concerned with the processing of natural language.

NLP is all about enabling computers to understand and generate human language. 

NLP is generally divided into the following fields:

 * **Speech Recognition** - The translation of spoken language into text
 * **Natural Language Understanding** - A computers ability to understand text
 * **Natural Language Generation** - The automatic generation of natural language

 
## Base techniques of text processing

### Parsing

Parsing refers to the formal analysis of a sentence by a computer into its component parts.

The results in a parse tree that shows syntactic relation of the parts to each other.

Here is an example of a parse tree for the sentence: "The thief robbed the appartment":

<img src="images/parse-tree.jpg" height="350" width="500"/> 

The words are annotated with part of speech tags: noun, verb, determiner etc.

Subsequences of words are grouped together: 

 * "the thief" is a noun phrase
 * "robbed the apartment" is a verb phrase
 * the whole structure forms a sentence

### Text segmentation

Text segmentation is the process of detecting word and sentence boundaries. 

Detecting word boundaries by splitting at whitespaces is not reliable, take for example the variantes "ice box" or "ice-box".

Detecting sentence boundaries based on certain punctuations (e.g. ".", "!", "?" etc.) is likewise not reliable. 

### Vocabulary

The vocabulary is a mapping of words to integer IDs. Normally only word contained in the vocabulary are used for model training or predictions. 

So called `stop words` are usually removed from the dictionary. Stop words are words that do not contain significant information.

Often the vocabulary also maintains a count in how many documents a word occurs. This allows to remove the most frequent and least frequent words. The most frequent words appear in many documents and therefore do not carry much information.

### Lemmatization

Lemmatization is the process of reducing words to their base form. 

For example "touch" is the base form of "touching", "touched" etc.

Lemmatization is useful because we encounter different variations of words that actually have the same base form and meaning.

Lemmatization helps to reduce the size of our working vocabulary without losing any information. Typical size reduction are by a factor of 2 or 3.

In general reducing the amount of data reduces processing time. For some machine learning problems lemmatization helps to focus the training on a smaller number of classes.

Lemmatization is also called stemming.

### Named Entity Recognition

Named Entity Recognition (NER) annotates parts of text with pre-defined categories like:

 * person
 * location (city, country etc.)
 * organization
 * numeric values (monetary value, percentage, date etc.)
 
NER injects additional semantic into a text dataset.

NER also helps to detect words that belong together and fuse them. For example if the consecutive words "New" and "York" are annotated with CITY they should be fused to "New York" and be included as separate word in the vocabulary.

### Relationship extraction

Relationship Extraction takes the named entities that result from NER and tries to identify the semantic relationships between them. 

This could mean for example

 * who is married to whom
 * a person works for a specific company
 * a user likes or dislikes a specific product


## Machine learning tasks for NLP

Some examples what kind of NLP problems are solved using machine learning:

 * Part-of-speech tagging - Assign a POS tag to each token in a document
 * Named entity extraction - Assign Named Entity tags to each token in the document
 * Machine translation - Translate a text written in a language to one or more other languages
 * Document clustering - Group documents by some criteria, e.g. similar content, documents are used together etc.
 * Document classification - Assign a category to a document
 * Text summarization - Create a short summary for a text of variable length 
 * Keyword tagging - Assign a set of tags to a document, e.g. to support document organization or indexing

## Text representation

When we feed text to a machine learning model it must be encoded as a **numeric tensor**.

In any case we will have a **vocabulary** that maps words to word IDs. 

The vocabulary is usually a by-product of preprocessing the document dataset. 

One interesting aspect of text representation is whether it is capable of capturing **word semantic**.

Examples of word semantic:

 * synonyms: strong, powerful, mighty
 * hierarchies: company, department, team
 * word classes like color: blue, red, green
 * semantic relations like country/language: France/french, Spain/spanish etc.
 
### One-hot encoding

Words are encoded as one-hot vectors:

 * Dimensionality of one-hot vector is same as number of words in the vocabulary
 * The vector contains only zeros except the index that represents the word is set to 1

Documents are the union of the vectors of the words they contain.

In general one-hot encoding is a poor representation:

 * one-hot vectors are sparse, little overlap between different documents
 * does not capture semantic
 * the ordering of words is lost

 
### TFIDF

[Term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (TFIDF), is a score that reflects how important a word is to a document in a collection.

The TFIDF value increases with the number of times a word appears in the document.

It decreases by the number of documents in the corpus that contain the word

TFIDF is based on the following assumptions:

 * The term appearing often in the document may be more important for identification than the term appearing rarely.
 * If a term appears in many documents, it will be probably irrelevant.

A document is represented by a vector that contains the TFIDF values for all words in the vocabulary.

TFIDF is in general a better representation as one-hot encoding but still has the same problems:

 * vectors are very large and sparse
 * TFIDF is based on frequency counts and does not capture semantic
 * word order is lost

### Word embeddings

Word embeddings represent words as fixed size vectors of 100 to 300 dimensions.

Words with similar meaning are mapped to a similar position in the vector space, e.g. they have a smaller distance to each other than words that are less similar.

Word embeddings are also able to encode other semantic relationships between words.

There are two options to represent document: 
 1. The word vectors are stored as 2D tensors. The first axis is the word position and the second axis is the embedding dimension.
 2. Average the word vectors
 
Both representations usually only work for short documents. 

### Document embeddings

Document embeddings represent whole documents as fixed size vectors of 100 to 300 dimensions.

Documents that have similar content are mapped to similar positions in the vector space.
