# Deep Learning for Natural Language Processing

***This is my notes of 7-day Mini-Course (created by Jason Brownlee) <br>*** *May have some classmates insights*

This crash course is broken down into 7 lessons.

Below are 7 lessons that will get you started and productive with deep learning for natural language processing in Python:

* Lesson 01: [Deep Learning and Natural Language](#01)
* Lesson 02: [Cleaning Text Data](#02)
* Lesson 03: [Bag-of-Words Model](#03)
* Lesson 04: [Word Embedding Representation](#04)
* Lesson 05: [Learned Embedding](#05)
* Lesson 06: [Classifying Text](#06)
* Lesson 07: [Movie Review Sentiment Analysis Project](#07)

<a name='01'></a>
## Deep Learning and Natural Language ##

* NPL: automatic manipulation of natural language, like speech and text, by software.
* Deep Learning is a subfield of ML concerned with algorithms inspired by the structure and function of the brain (ANN).
* A nice benefit of DP is the ability to perfom automatic feature extraction from raw data (**feature learning**).

10 impressive applications of deep learning
1. **Automatic Colorization of Black and White Images** <br>
    Generally this approach involves the use of very large convolutional neural networks and <br> supervised layers that recreate the image with the addition of color. <br>
    The research [Learning Representations for Automatic Colorization]() <br>
    strives to make colorization  cost-effective and less-time consuming. <br>

    **used dataset**: [ImageNet](https://paperswithcode.com/dataset/imagenet) 
    
    *E.g., second the autor, using [Hand-colouring](https://www.reddit.com/r/Colorization/), this photo took approximately 1h30 to be done (colorized):*  <br>
    
    <img src="https://raw.githubusercontent.com/adilsonmedronha/Tutorials/main/Deep%20Learning%20for%20Natural%20Language%20Processing/images/1_reddit_post_ww1_German_soldier_colorization.png" width="250" height="250" />
    
    With multiple applications that can benefit from automatic colorization:
    * Historical photographs and videos
    * Artist assistance 
    * papers: [1](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/1_Colorful_Image_Colorization_1603-08511.pdf), [2](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/1_Learning_Representations_for_Automatic_Colorization_1603-06668.pdf) <br> <br>

    
2. **Automatically Adding Sounds to Silent Movies** <br>
A deep learning model associates the video frames with a database of pre-rerecorded sounds <br> in order to select a sound to play that best matches what is happening in the scene.
    * [Video demonstration](https://youtu.be/0FW99AQmMc8) 
    * [Paper](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/2_Visually_Indicated_Sounds_1512-08512.pdf) <br> <br>


3. **Automatic Handwriting Generation** <br>
    This is an interesting task, where a corpus of text is learned and from this model new text is generated, word-by-word or character-by-character.
    * Papers: [1](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/3_Generating_Sequences_With_1308-0850v5.pdf), [2](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/3_Generating_Text_with_Recurrent_Neural_Networks_LANG-RNN.pdf) <br> <br>

4. **Automatic Image Caption Generation** <br>
    Generally, the systems involve the use of very large convolutional neural networks for the object detection in the photographs and then a recurrent neural network like an LSTM to turn the labels into a coherent sentence. <br>
    Automatic image captioning is the task where given an image the system must generate a caption that describes the contents of the image.
    * Papers: [1](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/4_Deep_Visual_Semantic_Alignments_for_Generating_Image_Descriptions_cvpr2015.pdf), [2](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/4_Explain_Images_with_Multimodal_Recurrent_Neural_Networks_1410-1090v1.pdf) <br> <br>

5. **Sentiment analysis** <br>
    Aspect specific sentiment analysis using hierarchical deep learning <br> <br>

6. **Text classification** <br>
    Recurrent Convolutional Neural Networks for Text Classification. <br>
    The task is to assign a document to one or more classes or categories <br>
    * [Paper](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/6_Convolutional_Neural_Networks_For_Text_Classification_9745-44425-1-PB.pdf) <br> <br>

7. **Named Entity Recognition** <br>
    Neural architectures for named entity recognition. NER — sometimes referred to as entity chunking, <br> extraction, or identification — is the task of identifying and categorizing key information (entities) in text.
    * [Paper](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/7_Neural_Architectures_for_Named_Entity_Recognition_1603-01360.pdf) <br> <br>

8. **Reading Comprehension** <br>
    The answer to each question is a segment of text from the corresponding reading passage (Fig. 1) <br> (Stanford Researchers) we build a strong logistic regression model, which achieves an F1 score of 51.0% <br>
    However, human performance (86.8%) is much higher.
    * [Paper](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/8_SQuAD_100%2C000%2B_Questions_for_Machine_Comprehension_of_Text1606-05250.pdf) <br> <br>

9. **[InferKit's Text Generation](https://app.inferkit.com/demo)**
    It is a tool takes text you provide and generates what it thinks comes next, <br>
    using a state-of-the-art neural network. It's configurable and can produce any length of text on <br> practically any topic. An example: <br>
        Input: <br>
        While not normally known for his musical talent, Elon Musk is releasing a debut album <br>
        Completion: <br>
        While not normally known for his musical talent, Elon Musk is releasing a debut album. <br>
        **It's called "The Road to Re-Entry," and it features an astounding collection of songs... (continued)** <br> <br>

10. **Pixel restoration CSI style**
    Early in 2017, Google Brain researchers trained a Deep Learning network to take very low resolution images of faces and predict what each face most likely looks like. <br>
    * [Paper](https://github.com/adilsonmedronha/Tutorials/blob/main/Deep%20Learning%20for%20Natural%20Language%20Processing/papers/10_Pixel_Recursive_Super_Resolution_1702-00783.pdf)


**Additional Examples** <br>
Below are some additional examples to those listed above:

* Automatic speech recognition. <br>
    [Deep Neural Networks for Acoustic Modeling in Speech Recognition](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf), 2012 <br>
* Automatic speech understanding. <br>
    [Towards End-to-End Speech Recognition with Recurrent Neural Networks](http://proceedings.mlr.press/v32/graves14.pdf), 2014 <br>
* Automatically focus attention on objects in images. <br>
    [Recurrent Models of Visual Attention](https://arxiv.org/pdf/1406.6247v1.pdf), 2014 <br>
* Automatically answer questions about objects in a photograph. <br>
    [Exploring Models and Data for Image Question Answering](https://arxiv.org/pdf/1505.02074v4.pdf), 2015 <br>
* Automatically turing sketches into photos. <br>
    [Convolutional Sketch Inversion](https://arxiv.org/pdf/1606.03073.pdf), 2016 <br>
* Automatically create stylized images from rough sketches.  <br>
    [Neural Doodle](https://github.com/alexjc/neural-doodle) <br>

**Here’s 30 impressive NLP applications using deep learning methods:** <br>
    [It is a map with links](https://www.xmind.net/m/AEYf/). Download this and open it on XMind.

<a name='02'></a>
## Cleaning Text Data ##

**Text is Messy**: you cannot go straight from raw text to fitting a ML or DP model 

You must clean your text first, which means splitting it into words and normalizing issues such as:
* Upper and lower case characteres.
* Punctuation within and around words.
* Numbers such as amounts and dates.
* Spelling mistakes and regional variations.
* Unicode characters.
* and much more...

**Manual Tokenization**

* **Tokenization**: turning raw text into something we can model (data structure)
 

In [14]:
# can be a filename ...
data = "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life"
words = data.split()
words = [word.lower() for word in words]

**NLTK Tokenization**
Many of the best practices for tokenizing raw text have been captured <br> 
and made available in a Python library called the Natural Language Toolkit or NLTK for short.

In [33]:
# import nltk 
# nltk.download()
# or via a command line:
!python -m nltk.downloader all

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\admed\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\admed\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\admed\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\admed\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\admed\AppData\Roaming\nltk_data...
[nltk_data]    | 

In [16]:
filename = "datasets\Bible_KJ.txt"
file = open(filename, "rt", encoding='utf-8')
text = file.read()
file.close()

Pre-processing 
* **Stop words** is a commonly used word (such as “the”, “a”, “an”, “in”,...) <br>
    We would not want these words to take up space in our database  
* **Stemming** is the process of reducing inflected (or sometimes derived) words to their word stem. <br>
    *E.g. the words [consultant, consulting, consultantive, consulting] need to be read as **consult**.*

In [32]:
# split into words
from nltk.tokenize import word_tokenize as wt
original = tokens = wt(text)

# convert to lower case
tokens = [word.lower() for word in tokens]

# remove non-alphabetic tokens
tokens = [word for word in tokens if word.isalpha()]

# filter out stop words
#   Stop words are available in abundance in any human language. 
#   By removing these words, we remove the low-level information 
#   from our text in order to give more focus to the important information
from nltk.corpus import stopwords
stopwords = set(stopwords.words("english"))
tokens = [word for word in tokens if not word in stopwords]

# steamming of words
from nltk.stem.porter import PorterStemmer as ps
porter = ps()
tokens = [porter.stem(word) for word in tokens]

cleaned = tokens

def to_string(sel = "before", data = original):
    print("\n {} pre-processing:\n Size = {} | #Stop Words = {} | Demo = \n{} ".format(sel, len(data), len([word for word in data if word in stopwords]), data[:50]))

to_string()
to_string(sel = "after", data = cleaned)


 before pre-processing:
 Size = 952469 | #Stop Words = 379601 | Demo = 
['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'King', 'James', 'Bible', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it'] 

 after pre-processing:
 Size = 376479 | #Stop Words = 176 | Demo = 
['project', 'gutenberg', 'ebook', 'king', 'jame', 'bibl', 'ebook', 'use', 'anyon', 'anywher', 'unit', 'state', 'part', 'world', 'cost', 'almost', 'restrict', 'whatsoev', 'may', 'copi', 'give', 'away', 'term', 'project', 'gutenberg', 'licens', 'includ', 'ebook', 'onlin', 'locat', 'unit', 'state', 'check', 'law', 'countri', 'locat', 'use', 'ebook', 'titl', 'king', 'jame', 'bibl', 'releas', 'date', 'august', 'ebook', 'recent', 'updat', 'decemb', 'langua