# Natural Language Processing
## Building an NLP Pipeline
- Text processing
- Feature extraction
- Modelling

### Text processing
- format from: html, pdf, docs, voice, books etc
- format to: plain text (stem, punctuation, stop words etc)

### Feature extraction
- depends on what kind of model you're using and what task you're tying to accomplish.
  - if you want to use a graph based model to extract insights, you may want to represent your words as symbolic nodes with relationships between them like **WordNet**.
  - for statistical models, you need some sort of numerical representation.
- think about the end goal
  - if you're tying to perform a document level task, such as spam detection or sentiment analysis, you may want to use a per document representations such as **bag-of-words** or **doc2vec**.
  - if you want to work with individual words and phrases, such as for text generation or machine translation, you'll need a word level representation such as **word2vec** or **glove**

### Modelling
- designing a model: statistical or machine learning model
- fitting its parameters to training data using an optimization procedure
- using it to make predictions about unseen data

### Part-of-Speech Tagging
- Note: **Part-of-speech** tagging using a predefined grammar but limited solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!
- There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including **Hidden Markov Models (HMMs)** and **Recurrent Neural Networks (RNNs)***.


In [6]:
import nltk 
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize
# Tag parts of speech (PoS)
sentence = word_tokenize("I always lie down to tell a lie")
pos_tag(sentence)

[nltk_data] Downloading package punkt to /Users/qingqing/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/qingqing/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN')]

### Named Entity Recognition

In [8]:
nltk.download('maxent_ne_chunker')
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize 
# Recognize named entities in a tagged sentence 
ne_chunk(pos_tag(word_tokenize("Antonio joined Udacity Inc. in California.")))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/qingqing/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


ModuleNotFoundError: No module named 'numpy'

In [None]:
! conda install numpy

Solving environment: done

## Package Plan ##

  environment location: /anaconda3/envs/NLP

  added / updated specs: 
    - numpy


The following NEW packages will be INSTALLED:

    blas:         1.0-mkl              
    intel-openmp: 2018.0.3-0           
    libgfortran:  3.0.1-h93005f0_2     
    mkl:          2018.0.3-1           
    mkl_fft:      1.0.4-py36h5d10147_1 
    mkl_random:   1.0.1-py36h5d10147_1 
    numpy:        1.15.1-py36h6a91979_0
    numpy-base:   1.15.1-py36h8a80b8c_0

Proceed ([y]/n)? 