## Natural Language Processing Pipelines

In this lesson, you'll be introduced to some of the steps involved in a NLP pipeline:

    1. Text Processing - Take raw input text, clean it, normalize it, and convert it into a form that is suitable for feature extraction.
       - Cleaning
       - Normalization
       - Tokenization
       - Stop Word Removal
       - Part of Speech Tagging
       - Named Entity Recognition
       - Stemming and Lemmatization
    2. Feature Extraction - Extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.
       - Bag of Words
       - TF-IDF
       - Word Embeddings
    3. Modeling - Design a statistical or machine learning model, fit its parameters to training data, use an optimization procedure, and then use it to make predictions about unseen data.
    
This process isn't always linear and may require additional steps.

Udacity prescribed text processing steps:
    1. normalize text and remove punctuation
    2. split and tokenize
    3. remove stopwords (reduce vocab)
    4. lemmatize first then stem
    
This is my 2 cents:
    1. use NER (named entity recognition, also GPE (geo-political entity))
    2. normalize text and remove punctuation (possibly also numbers, url, etc)
    3. split and tokenize
    4. remove stopwords
    5. lemmatize first then stem
    6. phrase modeling

Depends on what you want to do with NLP, use a different feature extraction technique:
1. For spam detection or sentiment analysis, which are per doc representation: bag-of-words, doc-to-vec
2. Working with individual word such as text generation, machine translation, use word level representation: word-to-vec or glove.

#### Bag of Words (treat each words as equally important)
- unordered collection of words, hence "bag"
- normalization, remove stop words, lemmatizing, stemming, etc
- treat the unordered token as a set

Document Term Matrix
- more efficient to turn each doc into vector of numbers, representing each time they appear in the document
- use dot product to compare the two documents (similarity and differences)
- use cosine similarity

#### TF-IDF (proportional to frequency in a document, but inversely proportional to the number of documents the words appears in, in an attemp to highlight uniqueness)
- term frequency x inverse document frequency

look up applications for: word2vec, GloVe(Global Cectors for word representations), Embedding for, t-SNE, and embeddings for deep learning