# Natural Language Processing
## Building an NLP Pipeline
- Text processing
- Feature extraction
- Modelling

### Text processing
- format from: html, pdf, docs, voice, books etc
- format to: plain text (stem, punctuation, stop words etc)

### Feature extraction
- depends on what kind of model you're using and what task you're tying to accomplish.
  - if you want to use a graph based model to extract insights, you may want to represent your words as symbolic nodes with relationships between them like **WordNet**.
  - for statistical models, you need some sort of numerical representation.
- think about the end goal
  - if you're tying to perform a document level task, such as spam detection or sentiment analysis, you may want to use a per document representations such as **bag-of-words** or **doc2vec**.
  - if you want to work with individual words and phrases, such as for text generation or machine translation, you'll need a word level representation such as **word2vec** or **glove**

### Modelling
- designing a model: statistical or machine learning model
- fitting its parameters to training data using an optimization procedure
- using it to make predictions about unseen data

### Part-of-Speech Tagging
- Note: **Part-of-speech** tagging using a predefined grammar but limited solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!
- There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including **Hidden Markov Models (HMMs)** and **Recurrent Neural Networks (RNNs)***.


In [6]:
import nltk 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize
# Tag parts of speech (PoS)
sentence = word_tokenize("I always lie down to tell a lie")
pos_tag(sentence)

[nltk_data] Downloading package punkt to /Users/qingqing/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/qingqing/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN')]

### Named Entity Recognition

In [None]:
nltk.download('maxent_ne_chunker')
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize 
# Recognize named entities in a tagged sentence 
ne_chunk(pos_tag(word_tokenize("Antonio joined Udacity Inc. in California.")))

### Bag of Words 
- To obtain a bag of words from a piece of raw text, need to simply apply appropriate text processing steps: 
    - cleanning 
    - normalizing
    - splitting into words 
    - stemming
    - lemmatization 
- treat the resulting tokens as an un-ordered collection of set 
_But keeping these as separeate sets is very inefficient, they may have differnt sizes, may contain different words and are hard to compare, and words may appear multiple times_

- a more useful approach is to turn each document into a vector of numbers represing how many times each word occurs in a document 
- A set of document is called corpus and this gives the context of the vectors to be calculated 
    - first: collect all the unique words present in your corpus to form your vocabulary 
    - second: arrange these words in some order and let them form the vector element positions or columns of a table and assume each document is a row 
    - third: count the number of occurrences of each word in eacch document and enter the value in the respective column, at this stage, it is easier to think of this as a **Document-Term-Matrix**, illustratin the relationship between documents in rows and words or terms in columns 
- One possibility is to compare documents based on how many words they have in common or how similar their term frequencies are 
    - **dot product between the two row vectores**: is the sum of the products of corresponding elements. greater the products, more similar the two vectors are. but  it only capture the overlap not affacted by other values that are not uncommon
    - **cosine similarity**: divide the dot product of two vectors by the product of their magnitudes or Euclidean norms 

### TF-IDF
- tfidf(t,d,D) = tf(t,d) * idf(t,D) = term frequency * inverse document frequency = count(t,d)/|d| * log(|D|/|{d belongs to D: t belongs to d}|

### One-Hot Encoding 
- treat each word like a class 
- assign it a vector that has one in a single pre-determined position for that word and zero everywhere else, just like the bag-of words idea, only that we keeyp a single word in each bag and build a vector for it 

### Word Embeddings 
- control word representation by limiting it to a fixed-size vector 
    - if two words are similar in meaning, they should be closer to each other compared to words that are not
    - if tow words have a similar difference in their meanings, they should be approximately equally separated in the embedded space
    - can use such a representation for a variety of purposes like finding synonyms and analogies, identifying concepts around which words are clustered, classifying words as positive, negative, neutral,etc. 

### t-SNE (t-Distributed Stochastic Neighbor Embedding)
- a dimensionality reduction technique that can map high dimensional vectors to a lower dimensional space. 
- when applying transformation, it tries to maintain relative distances between objects, so that similar ones stay closer together while dissimilar objects stay further apart. good visualization for word-embedding

## Voice user interfaces 
- Changes 
    - Variability 
        - Pitch 
        - Volume 
        - Speed
    - Ambiguity 
        - Word Boundaries 
        - Spelling 
        - Context 

### Language models 
- Deep Neural Networks(**DNN**) as Speech Models
- speech --> features --> acoustic model --> phonemes --> words --> language model --> text
    - speech --> features: 
        - MFCC: To extract relevant patterns 
        - CNN: Also finds relevant patterns 
    - features --> words:
        - HMM's: Time series data & Sequencing 
        - RNN: Also time series 
        - CTC: Sequencing 
    - words --> text 
        - N-grams: Could still be used 
        - NLM: Netural Language model 