## Don't worry if you don't understand everything at first! You're not supposed to. We will start using some "black boxes" and then we'll dig into the lower level details later.

## To start, focus on what things DO, not what they ARE.

# What is NLP?
    
    Natural Language Processing is technique where computers try an understand human language and make meaning out of it.
    

NLP is a broad field, encompassing a variety of tasks, including:

    1.  Part-of-speech tagging: identify if each word is a noun, verb, adjective, etc.)
    2.  Named entity recognition NER): identify person names, organizations, locations, medical codes, time expressions, quantities, monetary values, etc)
    3.  Question answering
    4.  Speech recognition
    5.  Text-to-speech and Speech-to-text
    6.  Topic modeling
    7.  Sentiment classification
    9.  Language modeling
    10. Translation



# What is NLU?
 
 Natural Language Understanding is all about understanding the natural language.
 
 Goals of NLU
  1. Gain insights into cognition
  2. Develop Artifical Intelligent agents as an assistant.  

# What is NLG?

Natural language generation is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. 

Example applications of NLG
    1. Recommendation and Comparison 
    2. Report Generation –Summarization 
    3. Paraphrase 
    4. Prompt and response generation in dialogue systems 
    
    

# Packages

1. [Flair](https://github.com/zalandoresearch/flair)
2. [Allen NLP](https://github.com/allenai/allennlp)
3. [Deep Pavlov](https://github.com/deepmipt/deeppavlov)
4. [Pytext](https://github.com/facebookresearch/PyText)
5. [NLTK](https://www.nltk.org/)
6. [Hugging Face Pytorch Transformer](https://github.com/huggingface/pytorch-transformers)
7. [Spacy](https://spacy.io/)
8. [torchtext](https://torchtext.readthedocs.io/en/latest/)
9. [Ekphrasis](https://github.com/cbaziotis/ekphrasis)
10. [Genism](https://radimrehurek.com/gensim/)

# NLP Pipeline

## Data Collection

### Sources

For Generative Training :- Where the model has to learn about the data and its distribution 
    1. News Article:- Archives
    2. Wikipedia Article 
    3. Book Corpus 
    4. Crawling the Internet for webpages.
    5. Reddit

Generative training on an abundant set of unsupervised data helps in performing Transfer learning for a downstream task where few parameters need to be learnt from sratch and less data is also required.

For Determinstic Training :- Where the model learns about Decision boundary within the data.
    Generic
        1. Kaggle Dataset
    Sentiment
        1. Product Reviews :- Amazon, Flipkart
    Emotion:-
        1. ISEAR
        2. Twitter dataset
    Question Answering:-
        1. SQUAD
    etc.
    
### For Vernacular text
In vernacular context we have crisis in data especially when it comes to state specific language in India. (Ex. Bengali, Gujurati etc.) 
Few Sources are:-
1. News (Jagran.com, Danik bhaskar)
2. Moview reviews (Web Duniya)
3. Hindi Wikipedia
4. Book Corpus
6. IIT Bombay (English-Hindi Parallel Corpus)

### Tools
1. Scrapy :- Simple, Extensible framework for scraping and crawling websites. Has numerous feature into it.
2. Beautiful-Soup :- For Parsing Html and xml documents. 
3. Excel 
4. wikiextractor:- A tool for extracting plain text from Wikipedia dumps

### Data Annotation Tool

1. TagTog
2. Prodigy (Explosion AI)
3. Mechanical Turk 
4. PyBossa
5. Chakki-works Doccano   
6. WebAnno
7. Brat

## Data Preprocessing

1.  Cleaning
2.  Regex 
    1. Url Cleanup
    2. HTML Tag
    3. Date
    4. Numbers
    5. Lingos
    6. Emoticons 
3.  Lemmatization 
4.  Stemming
5.  Chunking
6.  POS Tags
7.  NER Tags
8.  Stopwords
9.  Tokenizers
10. Spell Correction
11. Word Segmentation
12. Word Processing 
    1. Elongated
    2. Repeated
    3. All Caps

### Feature Selection

1. Bag of Words
![](https://uc-r.github.io/public/images/analytics/feature-engineering/bow-image.png)
2. TF-IDF
![](https://miro.medium.com/max/3604/1*ImQJjYGLq2GE4eX40Mh28Q.png)
3. Word Embeddings
    1. Word2Vec
    
    Word2Vec is a predictive model.
    ![](https://skymind.ai/images/wiki/word2vec_diagrams.png)
    2. Glove
    
    Glove is a Count-based models learn their vectors by essentially doing dimensionality reduction on the co-occurrence counts matrix.
    3. FastText
        
        Fastext is trained in a similar fashion how word2vec model is trained, the only difference is the fastext enchriches the word vectors with subword units.
        
        [FastText works](https://www.quora.com/What-is-the-main-difference-between-word2vec-and-fastText)
        
    4. ELMO
            
           ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.
           
        ELMo representations are:

        * Contextual: The representation for each word depends on the entire context in which it is used.
        * Deep: The word representations combine all layers of a deep pre-trained neural network.
        * Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training. 

### Modelling

1.  RNN
![](https://proxy.duckduckgo.com/iu/?u=http%3A%2F%2Fcorochann.com%2Fwp-content%2Fuploads%2F2017%2F05%2Frnn1_expand.png&f=1&nofb=1)

RNN suffers from gradient vanishing problem and they do not persist long term dependencies.
2.  LSTM

Long Short Term Memory networks – usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies. 

LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn!

![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1*6vw1g-HNuOgRYPj-IGhddQ.png&f=1&nofb=1)



3.  BI-LSTM
![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn-images-1.medium.com%2Ffreeze%2Fmax%2F1000%2F1*QBrVVvYps5zo6QtBRRq4fA.png%3Fq%3D20&f=1&nofb=1)

4.  GRU

5.  CNNs
6.  Seq-Seq
![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1*_6-EVV3RJXD5KDjdnxztzg%402x.png&f=1&nofb=1)

7.  Seq-Seq Attention
![](https://pravn.files.wordpress.com/2017/11/luong.png?w=319)
8.  Pointer Generator Network
![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Ftse1.mm.bing.net%2Fth%3Fid%3DOIP.c6kke1e2bWMaicGFw7wTwwHaEM%26pid%3DApi&f=1)
8.  Transformer
![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fraw.githubusercontent.com%2FDongjunLee%2Ftransformer-tensorflow%2Fmaster%2Fimages%2Ftransformer-architecture.png&f=1&nofb=1)
![](https://1.bp.blogspot.com/-AVGK0ApREtk/WaiAuzddKVI/AAAAAAAAB_A/WPV5ropBU-cxrcMpqJBFHg73K9NX4vywwCLcBGAs/s1600/image2.png)
9.  GPT
![](https://miro.medium.com/max/1772/1*MXspASIUulGBw58PyMA5Ig.png)
10. Transformer-XL
![](https://proxy.duckduckgo.com/iu/?u=https%3A%2F%2Fwww.lyrn.ai%2Fwp-content%2Fuploads%2F2019%2F01%2FTransformerXL-featured.png&f=1&nofb=1)
11. BERT

BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling.

BERT is given billions of sentences at training time. It’s then asked to predict a random selection of missing words from these sentences. After practicing with this corpus of text several times over, BERT adopts a pretty good understanding of how a sentence fits together grammatically. It’s also better at predicting ideas that are likely to show up together.

![](https://blog.fastforwardlabs.com/images/2018/12/Screen_Shot_2018_12_07_at_12_03_44_PM-1544202300577.png)
![](https://jalammar.github.io/images/bert-tasks.png)
12. GPT-2
![](https://miro.medium.com/max/1742/1*wUOgqwOJv-eMd0rSjWlTMg.png)

## Buisness Problem

1.  Text Classification
     1. Sentiment Classification
     2. Emotion Classification
     3. Reviews Rating
2.  Topic Modeling
3.  Named Entity Recognition
4.  Part Of Speech Tagging
5.  Language Model
6.  Machine Translation
7.  Question Answering
8.  Text Summarization
9.  Text Generation
10. Image Captioning
11. Optical Character Recognition
12. Chatbots
13. [Dependency Parsing](https://nlpprogress.com/english/dependency_parsing.html)
14. [Coreference Resolution](https://en.wikipedia.org/wiki/Coreference) 
15. [Semantic Textual Similarity](https://nlpprogress.com/english/semantic_textual_similarity.html)