<a href="https://colab.research.google.com/github/brendenwest/ad450/blob/master/10_natural_language_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

### Reading
- https://towardsdatascience.com/gentle-start-to-natural-language-processing-using-python-6e46c07addf3
- https://towardsdatascience.com/introduction-to-natural-language-processing-for-text-df845750fb63

### Tutorials
- https://spacy.io/usage/spacy-101
- https://realpython.com/natural-language-processing-spacy-python/
- https://www.datacamp.com/community/tutorials/stemming-lemmatization-python
- https://www.datacamp.com/community/tutorials/machine-learning-hotel-reviews
- https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python

### Practice
- https://learn.datacamp.com/courses/introduction-to-natural-language-processing-in-python

## Reference
- [SPACy](https://spacy.io/usage/spacy-101)
- [NLTK](https://www.nltk.org/)
- [regular expressions](https://digitalfortress.tech/tricks/top-15-commonly-used-regex/)


### Learning Outcomes
- what is natural language processing
- common Python libraries
- word tokenization
- stemming & lemmatization
- n-grams
- Text classification & sentiment analysis
- word vectors


# Overview

Natural Language Processing (NLP) applies data analysis and machine learning techniques to understanding text and speech.

Some common use cases for NLP are:
- word frequency analysis
- sentiment analysis
- plagiarism detection
- text-to-speech conversion
- machine translation
- text-generation (chat bots)

## Common libraries

The Python ecosystem has two widely used NLP libraries that provide proven solutions for routine text-processing tasks:

- [Natural Language ToolKit (NLTK)](https://www.nltk.org/) - Leading platform for python NLP programming, with interfaces to many lexical resources.
- [spaCy](https://spacy.io/) - a powerful, open-source library for advanced LP designed specifically for production use.

Either library is a solid choice for NLP programming.

# Common NLP Techniques

- **Tokenization** - Segmenting text into words, punctuations marks etc.
- **Part-of-speech (POS) Tagging** - Assigning word types to tokens, like verb or noun. 
- **Stemming & Lemmatization** - Reduce word variants to a common base form. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
- **Sentence Boundary Detection (SBD)** - Finding and segmenting text into individual sentences. 
- **Named Entity Recognition (NER)** - Labelling named “real-world” objects, like persons, companies or locations. |
- **Entity Linking (EL)** - Disambiguating textual entities to unique identifiers in a knowledge base.
- **Similarity** - Comparing words, text spans and documents and how similar they are to each other. 
- **Text Classification** - Assigning categories or labels to a whole document, or parts of a document.
- **Rule-based Matching** - Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.


- **Stop words** are words which are filtered out before or after processing of text. They usually refer to the most common words in a language.
- A **regular expression** is a sequence of characters that define a search pattern.
- The bag-of-words model is a simple feature extraction technique that describes the occurrence of each word within a document.
- **TF-IDF** is a statistical measure used to evaluate a word's **importance**  to a document.

## Tokenization
Tokenization segments text into words, punctuation and so on, applying rules specific to each language.

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for token in doc:
    print(token.text, end=" | ")

Apple | is | looking | at | buying | U.K. | startup | for | $ | 1 | billion | 

## Stemming & Parts of Speech
Part-of-speech (POS) Tagging - Assigns word types to tokens, like verb or noun.

In [14]:
print("{0:8} {1:8} {2:8} {3:8} {4:8} {5:8} {6:4} {7:4}".format("TEXT","LEMMA","POS","TAG","DEP","SHAPE","ALPHA","STOP"))
print("--" * 32)
for token in doc:
    print("{0:8} {1:8} {2:8} {3:8} {4:8} {5:8} {6:4} {7:4} ".format(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop))

TEXT     LEMMA    POS      TAG      DEP      SHAPE    ALPHA STOP
----------------------------------------------------------------
Apple    Apple    PROPN    NNP      nsubj    Xxxxx       1    0 
is       be       AUX      VBZ      aux      xx          1    1 
looking  look     VERB     VBG      ROOT     xxxx        1    0 
at       at       ADP      IN       prep     xx          1    1 
buying   buy      VERB     VBG      pcomp    xxxx        1    0 
U.K.     U.K.     PROPN    NNP      compound X.X.        0    0 
startup  startup  NOUN     NN       dobj     xxxx        1    0 
for      for      ADP      IN       prep     xxx         1    1 
$        $        SYM      $        quantmod $           0    0 
1        1        NUM      CD       compound d           0    0 
billion  billion  NUM      CD       pobj     xxxx        1    0 


### Stemming & Lemmatization

Stemming and lemmatization are special, but different, cases of text **normalization**. 

- **Stemming** is a crude heuristic process that chops off the ends of words in the hope of achieving a correct *base form*.

- **Lemmatization** uses vocabulary and morphological analysis of words to remove grammatical endings only and to return the base or dictionary form of a word (the **lemma**).

A stemmer operates without knowledge of the context, and cannot understand the difference between words which have different meaning depending on part of speech. But the stemmers are easier to implement and usually run faster.

## Named Entity Recognition
A **named entity** is a “real-world object” with a name – for example, a person, a country, a product or a book title. Entity recognition may use a **corpus** of known entities or a statistical model.

In [22]:
print("{0:10} {1:5} {2:5} {3:5}".format("TEXT","START","END","LABEL","DESCRIPTION"))
print("--" * 15)

for ent in doc.ents:
    print("{0:10} {1:5} {2:5} {3:5} ".format(ent.text, ent.start_char, ent.end_char, ent.label_))

TEXT       START END   LABEL
------------------------------
Apple          0     5 ORG   
U.K.          27    31 GPE   
$1 billion    44    54 MONEY 


## Word Vectors & Similarity

Machine learning algorithms cannot work with raw text directly, so we convert the text into vectors of numbers (`word vectors` or `word embeddings`). 

Word vectors represent each word in a text numerically such that the vector corresponds to how that word is used or what it means. This allows algorithms to determine similarity of text.

One common approach is the `bag-of-words` method, which describes the occurrence of every word within a document. The simplest method for scoring words in text is to mark the presence of words with 1 for present and 0 for absence. 

Other more complex algorithms, such as `word2vec` derive vectors that take account of the word's context. Words that appear in similar contexts will have similar vectors. Relations between words can be examined with mathematical (matrix) operations.

