----------
# Understanding And Implementing <font color=darkcyan>CNNs</font> For <font color=darkpink>NLP</font> With <font color=orange>Tensorflow</font>
***
***
# Let's Start by Understanding HOW <font color=darkcyan>CNNs</font> Can Be Used For <font color=darkpink>NLP</font></center>
***

### <font color=darkcyan>CNN -- Convolutional Neural Network</font><br><br> <font color=darkpink>NLP -- Natural Language Processing</font>

***

When we hear about Convolutional Neural Network (CNNs), we typically think of Computer Vision
- CNNs were responsible for major breakthroughs in Image Classification
- CNNs are the core of most Computer Vision systems today
    - Facebook’s automated photo tagging
    - Self-driving cars
    - Advanced counting algorithms
    - SLAM and other autonomous movement based off localization
    - Many more...

More recently we’ve also started to apply CNNs to problems in Natural Language Processing and we've gotten some interesting results

***
**In this how-to:**
1. I’ll try to summarize what CNNs are
2. How are CNNs used in NLP
3. The intuitions behind CNNs
    - Somewhat easier to understand for the Computer Vision use case, so I’ll start there, and then slowly move towards NLP
4. Implement a CNN towards Text Classification Utilizing Only Tensorflow (No Keras)

***

***
## Back to Basics (1) : What is a <font color=darkcyan>Convolution</font> ??
***

Convolution is a mathematical operation defined as a function derived from two given functions by integration that expresses how the shape of one is modified by the other. Now that's confusing as hell, but the idea is that a convolution links two functions together. If we consider one function to be the original image, and the second function to be some smaller image (a kernel), than a convolution operation simply explains a relationship between these two iamges.
***
**Let's look a bit more simplistically at the convolution operation and how it is used for computer vision:**
***

***
The **“convolution” operation** for computer vision is, at it's core, a tool used to extract features from an input image 
- A **convolution** preserves the spatial relationship between pixels by learning image features using small squares of input data
    - We will not go into the mathematical details of the **convolution operation** here, but will try to understand how it works intuitively
    - NOTE: The word Image here means any matrix with at least 3 dimensions -- **[Width**, **Height**, **Channels]**

***
Every image can be considered as a matrix of pixel values having at least 3 dimensions **[Width, Height, Channels]**
- Consider a 5 x 5 image whose pixel values are only 0 and 1
    - NOTE: For a grayscale image, pixel values range from 0 to 255
    - NOTE: The green matrix below is a grayscale image whose values have been squished to be either 0 or 1 -- *i.e. binarization via thresholding*)
***

***
![cnn_image_for_gif](inline_images/cnn_image_for_gif.png)

                     Fig. 1. This is the 5 x 5 image with pixel values being either 0 (black) or 1 (white)
***

***
![cnn_kernel_for_gif](inline_images/cnn_kernel_for_gif.png)

                Fig. 2. This is the 3 x 3 image (kernel) with 'pixel' values being either 0 (black) or 1 (white)
***

***
![cnn_gif](inline_images/cnn_gif.gif)

             Fig. 3. This is a gif showing the convolution operation (kernel over image) producing an output feature
                             The output feature is often called a 'convolved feature' or a 'feature map'
***

***
**Take a moment to understand how the computation above is being done**
***
1. We slide the orange matrix over our original image (green) 1 pixel at a time (stride=1)
2. For every position, we compute element wise multiplication (between the two matrices)
3. We add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink)
    
                --> NOTE: The 3×3 matrix only “sees” a PART (window) of the input image in each stride
***
**Let's See An Example Calculation To Illustrate How The Convolution is Done (Dot Product)**
***

***
**Example: Convolution For Bottom Right Hand Feature Value *( 3rd Row to 5th Row, 3rd Col to 5th Col, Indicated With Asterisk )***
***
       IMAGE             KERNEL                             MULTIPLICATION               ADDITION           1 CELL OF OUTPUT    
                                                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    |1|1|1|0|0|                                     |                                                 |
    |0|1|1|1|0|          [1|0|1]                    |  [1*1|1*0|1*1]   [1|0|1]       [1]+[0]+[1] +    |
    |0|0[1|1|1]*  <--->  [0|1|0]  --CONVOLUTION-->  |  [1*0|1*1|0*0] = [0|1|0] ----> [0]+[1]+[0] +    |      =     [4]
    |0|0[1|1|0]*         [1|0|1]                    |  [1*1|0*0|0*1]   [1|0|0]       [1]+[0]+[0] ...  |
    |0|1[1|0|0]*                                    |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
         * * *
***

***
**Convolution Terminology**
***
- The 3×3 matrix above is called:
    - ***kernel*** or...
    - ***filter*** or...
    - ***feature detector***
<br>
<br>
- The matrix formed by sliding the kernel over the image and computing the *dot product* is called:
    - ***Convolved Feature*** or...
    - ***Activation Map*** or
    - ***Feature Map***
***

***
It is evident from the animation above that different values of the *kernel* matrix will produce different *Feature Maps* for the same input image
***
**As an example, consider the following input image:**
***
![conv_kernel_original](inline_images/conv_kernel_original.png)

                     Fig. 4. This is the original image that the different kernels will be applied to
***
**Here are various kernels, as shown on Wikipedia [__[link](https://en.wikipedia.org/wiki/Kernel_(image_processing)__], and the various effects they have on the image**
***
![conv_kernel_wiki](inline_images/conv_kernel_wiki.PNG)

         Fig. 5. These are several kernels and their respective outputs w.r.t. to the original image (source: Wikipedia)

***

***
**Let's Look At a Gif Showing Two Convolution Operations On The Same Image Using Different Filters**
***
![kernel_conv_example_gif](inline_images/kernel_conv_example.gif)

     Fig. 6. This is a gif showing two convolution operations on the same image using different kernels (source: cs.nyu.edu)

***
Description of Above GIF
- A filter (with red outline) slides over the input image (convolution operation) to produce a feature map. 
- The convolution of another filter (with the green outline), over the same image gives a different feature map as shown
- It is important to note that the Convolution operation captures the local dependencies in the original image
- Also notice how these two different filters generate different feature maps from the same original image
- Remember that the image and the two filters above are just numeric matrices as we have discussed above
***
Filter/Kernel Values and Size
- In practice, a CNN learns the values of these filters on its own during the training process
    - Although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process
- The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in images  
***

***
**The size of the Feature Map (Convolved Feature) is controlled by *three parameters* that we need to decide before the convolution step is performed:**
<br>
***
**1. Depth:** 
<br>
***
- **Depth** corresponds to the number of filters we use for the convolution operation
- In the network shown in Figure 7, we are performing convolution of the original boat image using three distinct filters
    - This produces three different feature maps as shown
    - You can think of these three feature maps as stacked 2d matrices, so, the **depth** of the feature map would be three.
***        
![depth_image](inline_images/depth.png)

             Fig. 7. This is an image showing a convolution operation performed with 3 distinct filters

***
**2. Stride:** 
<br>
***
- **Stride** is the number of pixels by which we shift our filter matrix over the input matrix each step
    - When the **stride** is 1 then we move the filters one pixel at a time
    - When the **stride** is 2, then the filters jump two pixels at a time as we slide them around
        - Having a larger **stride** will produce smaller feature maps (unless we *pad*)
<br>
<br>
***
**3. Zero-Padding:** 
<br>
***
- Sometimes, it is convenient to **pad** the input matrix with zeros around the border (sometimes simply called ***padding***)
    - **Zero-padding** is done so that we can apply the filter to bordering elements of our input image matrix
    - A nice feature of **zero-padding** is that it allows us to control the size of the feature maps
        - Adding **zero-padding** is also called *wide convolution*, and not using **zero-padding** would be a *narrow convolution*
        - We will discuss wide and narrow convolutions in detail later


***
## Back to Basics (2) : What's a <font color=darkcyan>Convolutional Neural Network</font> ??
***

Now you know what convolutions are... But what about CNNs? 
- CNNs are basically just several layers of convolutions with ***nonlinear activation functions*** like ***ReLU*** or ***tanh*** applied to the results
    - This pretty much means that we check a part of the image using the kernel and if it ***sees*** something similar to itself than it will ***activate***
    - i.e. If a kernel to detect circles was on a part of an image with a circle, it will have a significant output from a nonlinear activation function

- In a traditional feedforward neural network we connect each input neuron to each output neuron in the next layer
    - Known as a ***fully connected layer (FC layer)***, or ***affine layer***
    - <font color=red>**In CNNs we don’t do that**</font>


                     
                    Instead, we use convolutions over the input layer to compute the output
                     


- This results in local connections, where each region of the input is connected to a neuron in the output
- Each layer applies different filters, typically hundreds or thousands like the ones showed above, and combines their results
- There’s also something something called pooling (subsampling) layers, but I’ll get into that later
- During the training phase, a CNN automatically learns the values of its filters based on the task you want to perform (as mentioned earlier)
    - For example in Image Classification: 
        - A CNN may learn to detect edges from raw pixels in the first layer
        - The CNN may then use the edges to detect simple shapes in the second layer
        - It would then then use these shapes to deter higher-level features
        - This process continues with increasing levels of abstraction (faces, groupings, assemblies, etc.)
        - **The last layer is then a classifier that uses these high-level features**

***
**Let's look at an example of a full Convolutional Neurel Network For Image Classification:**
***

***
![CNN](inline_images/CNN.png)

        Fig. 8. This is an image showing a convolution neurel network for image classification (Softmax for 4 Outputs)

***

***
**CNN Computation Terminology**

***
Here are two common ideas that will help us understand CNNs and are worth paying attention to: **Location Invariance** and **Local Compositionality**
***
**Location Invariance**
- Let’s say you want to classify whether or not there’s an elephant in an image
    - **Local invariance** is the practice of desiring to be able to find an elephant regardless of where it occurs in the image
        - Only being able to detect elephants in the bottom left hand corner is not particularly useful... hence the importance of **location invariance**
        - In practice *pooling* will also give invariance to translation/rotation/scaling... but more on that later...
<br>
<br>

**Local Compositionality (IN MY OPINION THIS IS THE KEY TO THE WHOLE THING)**
- Each filter composes a local patch of lower-level features into higher-level representation
    - This is the main strength of CNNs
    - This key aspect of CNNs is highly intuitive as it makes sense that you built edges from pixels, shapes from edges, objects from edges... and so on
***

***
## Back to Basics (3) : What is <font color=darkpink>Natural Language Processing</font> ??
***

Now you know what convolutions and convolution neurel networks are... but what about NLP? 
- Natural Language Processing, or NLP, is the sub-field of AI that is focused on enabling computers to understand and process human languages
                
              SIDENOTE: The study of human language is an example of UNSTRUCTURED Machine Learning

***
**Let's look at an example animation showing some unstructured data extraction that utilizes an NLP backend**
***

***
![London_NLP](inline_images/london_nlp.gif)

             Fig. 9. Unstructured Data Extraction For Facts About London (Source: NLP_IS_FUN.. link below)
***

***
**For more information on the basics of NLP I will link here to a tremendously well done article explaining the basics all the way through pipeline development... consider it a prerequisite for continuing the tutorial**
***

**As of November 26th, 2018, I have created/started-to-create a document to help better understand tradtional NLP... it is very similar to NLP_IS_FUN, but I tried to tailor it to my interests and locations and expand on it**


***
__[LINK_TO_NLP_IS_FUN](https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e)__
***

***
## How Do We Apply <font color=darkcyan>Convolutional Neural Networks</font> to <font color=darkpink>Natural Language Processing</font> ??
***

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix
- Each row of the matrix corresponds to one token, typically a word, but it could be a character
    - That is, each row is vector that represents a word
        - Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe
        - They could also be one-hot vectors that index the word into a vocabulary
        - These *representations* are usually made by someone else and we simply steal them as they are opensourced
    - For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input... Our ***'image'***

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words)
- The “width” of our filters is usually the same as the width of the input matrix
- The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical

***
**Putting all the above together, a <font color=darkcyan>Convolutional Neural Networks</font> for <font color=darkpink>Natural Language Processing</font> may look like this:**
<br>
<br>
*Take a few minutes and try understand this picture and how the dimensions are computed*
<br>
*Ignore the pooling for now, we’ll explain that later*
***

***
![CNN_FOR_NLP](inline_images/CNN_for_NLP_1.png)

            Fig. 10. Image Showing The Structure of a CNN That Is Used for NLP and the Flow of Information

***

**Detailed Description Below:**
***
- Above is an illustration of a Convolutional Neural Network (CNN) architecture for sentence classification
- Here we depict three filter region sizes: 2, 3 and 4, each of which each has 2 filters
    - Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps
- Then 1-max pooling is performed over each map
    - i.e. The largest number from each feature map is recorded
- Thus a univariate feature vector is generated from all six maps
- These 6 features are then concatenated to form a feature vector for the penultimate layer
- The final softmax layer then receives this feature vector as input and uses it to classify the sentence
    - We assume binary classification and hence depict two possible output states
    
**Source:** ***Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification.***
***

***
![CNN_FOR_NLP](inline_images/2range.gif)

            Fig. 11. GIF showing how the convolution operation (With a Range of 2) Would Move Over the Sentance

***

***
![CNN_FOR_NLP](inline_images/3range.gif)

            Fig. 12. GIF showing how the convolution operation (With a Range of 3) Would Move Over the Sentance

***

***
![CNN_FOR_NLP](inline_images/4range.gif)

            Fig. 13. GIF showing how the convolution operation (With a Range of 4) Would Move Over the Sentance

***

***
## <font color=red>Uh-Oh </font> : Are <font color=darkcyan>Convolutional Neural Networks</font> a Bad Fit for <font color=darkpink>Natural Language Processing</font> ??
***

We discussed earlier than CNN's made intuitive sense for image processing tasks as they are built to handle the two main requirements of the task, location invariance and local compositionality.
- While location invariance might seem to make sense at first (where a word appears in a sentance could seem important...), we soon start to realize that it may not be as important as other things.
    - Words that are close together in a sentance are often related but not always
    - Sometimes words from much earlier in a sentnace or from a previous sentance may have a large impact
- Local compositionality, similarily, may seem like an adequate tool for use in NLP as certain words may build higher conceptual ideas by compounding other words (adjectives modifying nouns, adverbs, and other modifiers, etc.)
    - However, how this is actually used to make sense of language is far from obvious
- Unlike with image processing, NLP does not seem to be an intuitive fit for CNNs

***
              SIDENOTE: RNNs (Recurrent Neurel Networks) are a MUCH more intuitive fit for modelling NLP tasks
                        as they are sequential models that are more similar in structure and function to how humans
                        think and speak
***
All this being said, a model does not have to be perfect, or even that good, to be useful. 
- A common approach in modern times has simply been counting the number of occurences of words in a sentance/paragraph and drawing conclusions
    - This is inherently flawed, yet it still has great utility in many cases

The main benefit that makes CNNs a worthwhile departure for certain NLP tasks are the **speed and efficency gains that they will yield when compared with more intuitive operations and models (RNNs)**
- This may seem like a small benefit, but when you consider an [N-Gram](https://en.wikipedia.org/wiki/N-gram) model ([K-Mer](https://en.wikipedia.org/wiki/K-mer) in genomics), which is what Google currently uses to predict what words you may wish to search, it becomes obvious that it's a big deal
    - Google is limited to N-Grams no larger than 5 due to computational limits
    - CNNs can calculate much larger N-Grams with relatively little computational stress
    - If Google struggles computationally with something it is pretty obvious that the task is computationally very intensive
    
***

***
## <font color=purple>Hyperparameter Exploration </font> for <font color=darkcyan>Convolutional Neural Networks</font>
***

Before we dive into the model, it makes sense to explore some of the hyperparameters that we will be navigating through, what they are, how we can change them to best suit our model, and how they relate differently to NLP than Image Processing.

***

***
## <font color=purple>Hyperparameter Exploration (1)</font> --- Narrow V. Wide Convolutions
***

A **narrow convolution**, also known as a non-padded convolution, is the simplest type of convolution.
- It involves applying a kernel over all of the images pixels without adding or taking anything away from the original image
- The output is a matrix that is smaller by a predictable amount
![Narrow Convolution](inline_images/no_padding_conv.gif)

                    Fig. 14. GIF showing how a narrow convolution operation works on a 5x5 image

***
A **wide convolution**, also known as a padded (or zero-padded) convolution, is a convolution that involves padding the edges of the original image with zeros
- This is so that the kernel can access all pixels (from the original image) with the center of the kernel
- This has the added benefit that the output shape of the new matrix is the same shape as the original image
![Wide Convolution](inline_images/zero_padded_conv.gif)

                    Fig. 15. GIF showing how a wide convolution operation works on a 5x5 image


***
**The equations to calculate the output matrices width and height are as follows :**

***
     
     Let P = Padding
     Let OMW = Output Matrix Width        Let W = Width        Let Fw = Filter Width        Let Sw = Stride Width
     Let OMH = Output Matrix Height       Let H = Height       Let Fh = Filter Height       Let Sh = Stride Height             
***
\begin{equation}
OMW = \frac{W - F_w + 2P}{S_w} + 1
\end{equation}
***
\begin{equation}
OMH = \frac{H - F_h + 2P}{S_h} + 1
\end{equation}

***

**Using the Padded Gif Above As An Example:**


- **P = 1**
- **W = 5**
- **H = 5**
- **Fw = 3**
- **Fh = 3**
- **Sw = 1**
- **Sh = 1**
***
**OUTPUT MATRIX WIDTH**
\begin{equation}
OMW = \frac{W - F_w + 2P}{S_w} + 1\\
OMW = \frac{5-3+2}{1} + 1\\
OMW = 5
\end{equation}
***
**OUTPUT MATRIX HEIGHT**
\begin{equation}
OMH = \frac{H - F_h + 2P}{S_h} + 1\\
OMH = \frac{5-3+2}{1} + 1\\
OMH = 5
\end{equation}
***
**Therefore the output dimensions are (5,5)... the same as the input image**
- This is a traditional example of zero-padding, which I personally refer to as *'same padding'* or a *'same convolution'*
    - Because the outupt size is the *'same'* as the input size

***
## <font color=purple>Hyperparameter Exploration (2)</font> --- Stride Size
***

Another hyperparameter for your convolutions is the **stride size**
- This is defining by how much you want to shift your filter at each step
- In all the examples above the stride size was 1
    - As such, consecutive applications of the filter would overlap
- A larger stride size leads to fewer applications of the filter and a smaller output size

***
**The following from the Stanford CS231 class [website](http://cs231n.github.io/convolutional-networks/)
 shows stride sizes of 1 and 2 applied to a one-dimensional input:**
![Stride Sizing](inline_images/stride.png)

    Fig. 16.         Convolution Stride Size 1                                   Convolution Stride Size 2
***

**Typically, in image processing, convolutional neural networks use a stride size of 1 so as not to miss any information (unless downsampling is desired)**
- However, for NLP, by using a larger stride we can develop a network that acts more similarily to an RNN and takes the shape of a tree
- We can observe from the image above, that by increasing the stride of our convolution, we decrease the size and lower the density of the output matrix (increasing sparsity)
    - This may prove useful in NLP applications and we will dive deeper later

***
## <font color=purple>Hyperparameter Exploration (3)</font> --- Pooling Layers
***

A key aspect of Convolutional Neural Networks are **pooling layers**
- Pooling layers are typically applied after the convolutional layers
- Pooling layers subsample their input (reduce the size by taking a representative sample)
    - Note what the subsample is representating changes depending on the type of pooling
- The most common way to do pooling it to apply a max operation to a region (or the entire convolution output)
    - This is called max pooling
- Another common operation is conducted by averaging the values in a region (or the entire convolution output)
    - This is called average pooling
***
          NOTE: Traditionally, in image processing we pool by passing a window (say 2x2) over the entire 
                convolutional output, however, in NLP we typically apply pooling over the complete output,
                yielding just a single number for each convolutional filter (as depicted in Fig 10)
***
**The following is a gif showing how traditional max pooling is conducted:**<br>
![Stride Sizing](inline_images/maxpool_animation.gif)

                      Fig. 17.    Max Pooling With a 2x2 Kernel Over a 4x4 Matrix Producing a 2x2 Matrix
***

**You might be curious as to why NLP follows a different procedure by processing the entire convolutional output into 1 number..**
- The best answer is probably that, regardless of the shape of the convolutional output, only one number is generated
    - This means that different sized sentances or words or paragraphs can be compared as this step will normalize them to a similar look output
- Inherently we are, again, losing information about context by pooling
    - If a filter is looking for the word "great" in a sentance, pooling may still communicate that the word "great was found... however, it may lose the piece of information that indicates **where** in the sentance the word "great" occurs
        - *This is acceptable for the reasons outlined in the 'Uh-Oh' section*
***

***
## <font color=purple>Hyperparameter Exploration (4)</font> --- Channels
***

The last concept we need to understand are **channels**
- Channels are different “views” of your input data
    - In image recognition you typically have RGB (red, green, blue) channels
        - Image channels may also be in other formats, larger, or smaller (Grayscale, LAB, HSV, RGBA, etc.) 
- You can apply convolutions across channels, either with different or equal weights
***
**In NLP you could imagine having various channels as well:**
- You could have a separate channels for different word embeddings (word2vec and GloVe for example)
- You could have a channel for the same sentence represented in different languages, or phrased in different ways
***
***It boils down to a different representation of the same ground truth... an image of a cat is an image of a cat and whether we view it in RGBA, Grayscale, or HSV, it does not change the fact that their is a cat present in the image***

***

***
## What <font color=darkpink>Natural Language Processing</font> Task Are <font color=darkcyan>Convolutional Neural Networks</font> Suited For ??
***
**To figure out what application we want to work through we will need to look at some of the applications of CNNs to Natural Language Processing that currently exist**
***
**1.  Text Classification**
- i.e. Sentiment Analysis, Spam Detection or Topic Categorization
    - A simple architecture would be as follows
        - Input layer is a sentence comprised of concatenated word2vec word embeddings
        - Followed by a convolutional layer with multiple filters
        - Followed by a max-pooling layer
        - Followed finally by a softmax classifier
        
![Text Classification Network Architecture](inline_images/network_arc.png)

        Fig. 18.    Network Architecture for Text Classification From Kim, Y. (2014). CNNs for Sentence Classification
***
**2.  Relation Extraction and Classification**
- Identifying "important" facts, ideas, topics, etc. in a corpus of text
    - This can be useful for identifying similar news articles
    - This could be useful for pulling out facts from a body of text
    - Other relational applications are possible as well
***
**3.  Semantic Information Within Sentances [-Microsoft-]**
- How to learn semantically meaningful representations of sentences that can be used for Information Retrieval
    - The example given in Microsft's papers includes recommending potentially interesting documents to users based on what they are currently reading
    - The sentence representations are trained based on search engine log data
***
**4.  Prediction**
- How to predict words or text based on words or text
    - The example given is predicting hashtags based on a tweet/facebook post
***

**Having looked through the various papers, I feel as though exploring <font color=green>basic text classification</font> will be of the most interest due to the state-of-the-art performance that has recently been achieved using CNNs**
- Note that we will implement a model similar to Kim Yoon’s Convolutional Neural Networks for Sentence Classification. The model presented in the paper achieves good classification performance across a range of text classification tasks (like Sentiment Analysis) and has since become a standard baseline for new text classification architectures
    - This is model and architecture discussed above and illustrated in *Fig. 18.*
***

***
## DOWN TO BUSINESS ! 
## CODE AND TASK DESCRIPTION START HERE !
***

***
# <center><br>Implementing a CNN for Text Classification
***

***
## <center><br>Data and Preprocessing
***
The dataset we’ll use is the Movie Review data from Rotten Tomatoes – one of the data sets also used in the original paper
- The dataset contains 10,662 example review sentences, half positive and half negative
- The dataset has a vocabulary size of around 20,000 words

        - NOTE: Since this represents a relatively small amount of data we’re likely to overfit with a powerful model. 
***                
**<center>The following are the preprocessing steps we will need to take before we can start implementing the CNN**
***
**Step 1  :**
- Load positive and negative sentences from the raw data files
***
**Step 2  :**
- Clean the text data using the same code as the original paper
***
**Step 3  :**
- Pad each sentence to the maximum sentence length, which turns out to be 59
    - To do this we append special <PAD> tokens to all the sentences that are not 59 words long to make them 59 words long
        - This allows us to efficiently batch our data since each example in a batch must be of the same length

***
**Step 4  :**
- Build a vocabulary index and map each word to an integer between 0 and 18,765 (the vocabulary size)
    - Therefore each sentence becomes a vector of integers (embedding)
***

***
### <center><br>Step 0 : Importing the Necessary Libraries
***

In [1]:
import numpy as np
import re

***
### <center><br>Step 1 & 2 : Define Functions to Load and Clean Positive and Negative Sentances From Raw Data Files
***

In [64]:
def clean_str(string):

    """
    DESCRIPTION :   Tokenization/string cleaning for all datasets except for SST.
    ARGUMENTS   :   A String (Sentance Usually)
    RETURNS     :   Tokenized string
    """
    
    # re.sub() -- https://docs.python.org/3/library/re.html
    # re.sub() -- [] is used to indicate sets or individual requirements to remove
    # re.sub() -- starting with ^ tells the expression to look for the complement (things not listed within)
    # re.sub() -- r"" starting with an r allows for backslashes and other special characters to be treated normally
    
    # This re.sub will remove all characters not found within the square brakets by replacing them with a space
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)

    # These re.subs will space any contractions so they are seperate from the root word 
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)    
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    
    # This re.sub will space punctuation so that it is not connected to the trailing word
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)    
    string = re.sub(r"\?", " \? ", string)
    
    # This re.sub will do something... I'm not sure   
    string = re.sub(r"\s{2,}", " ", string)
    
    # .strip() method removes leading and trailing spaces when not used with any argument
    string = string.strip()
    
    # .lwer() method changes all letters to lowercase
    string = string.lower()
    
    return string


def load_data_and_labels(positive_data_file, negative_data_file):
    
    """
    DESCRIPTION :   Loads MR polarity data from files, splits the data into words and generates labels.
    ARGUMENTS   :   Raw negative and positive data files
    RETURNS     :   Split sentences and labels
    """
    
    # ----------------------------------------------------------------------------    
    #                      Load Positive Data From Files
    # ----------------------------------------------------------------------------
    
    # open() - Returns a file object, and is most commonly used with two arguments: open(filename, mode)
    #          -- The first argument is a string containing the filename [filename]
    #          -- The second argument is a string containing characters that describe how the file will be used [mode]
    #             --- Mode can be 'r'  when the file will only be read (This is the default argument)
    #             --- Mode can be 'w'  when the file be written to (existing file with the same name will be erased)
    #             --- Mode can be 'a'  when the file will be appended (indexed to the end where text will be added)
    #             --- Mode can be 'r+' when the file will be read and written to
    #
    positive_examples_file = open(positive_data_file, "r", encoding='utf-8')
    
    # .readlines() - The method readlines() reads until EOF using readline() and returns a list containing the lines
    #                -- Note that each item in the list is the next sequential line in the file
    #
    positive_examples_line_by_line = positive_examples_file.readlines()

    # .strip() - A method to remove parts of a string
    #            -- if no argument is passed than the method will remove whitespace (leading and trailing)
    #            -- Note than an argument can be passed and if found in the sentance/word than it will be removed
    #               --- i.e. if the sentance is "I see cats" and you command sentance.strip("s") than it outputs "I ee cat"
    #               --- for our case if the sentance is "  I see cats " and you command sentance.strip() than it outputs "I see cats"
    #
    positive_examples = [s.strip() for s in positive_examples_line_by_line]
 
    # ----------------------------------------------------------------------------
    
    
    # ----------------------------------------------------------------------------
    #                      Load Negative Data From Files
    # ----------------------------------------------------------------------------
    
    # Follows the exact same procedure as detailed in more depth above (for positive data)
    negative_examples_file = open(negative_data_file, "r", encoding='utf-8')
    negative_examples_line_by_line = negative_examples_file.readlines()
    negative_examples = [s.strip() for s in negative_examples_line_by_line]
    
    # ----------------------------------------------------------------------------

    
    # ----------------------------------------------------------------------------
    #                              Split by Words
    # ----------------------------------------------------------------------------

    # Combine the corpuses into one large corpus containing all of the positive and negative examples
    x_text = positive_examples + negative_examples
    
    # Clean all of the strings using the function clean_str which will be described later
    x_text = [clean_str(sentance) for sentance in x_text]

    # ----------------------------------------------------------------------------

    
    # ----------------------------------------------------------------------------
    #                              Generate labels
    # ----------------------------------------------------------------------------

    # One-hot-encode all of the examples to either be positive or negative
    #   -- positive = [0, 1]
    #   -- negative = [1, 0])
    positive_labels = [[0, 1] for _ in positive_examples]
    negative_labels = [[1, 0] for _ in negative_examples]
    
    # ----------------------------------------------------------------------------

    
    # ----------------------------------------------------------------------------
    #                     Concatenate all Labels Together
    # ----------------------------------------------------------------------------

    y = np.concatenate([positive_labels, negative_labels], 0)
    
    # ----------------------------------------------------------------------------

    # ----------------------------------------------------------------------------    
    # Return sentances (that have been split) and label
    # ----------------------------------------------------------------------------

    return [x_text, y]

    # ----------------------------------------------------------------------------

***
### <center><br>Step 3 : Pad Each Sentence to the Maximum Sentence Length
***

In [61]:
_ = load_data_and_labels("./data/neg/cv000_29416.txt", "./data/pos/cv000_29590.txt")

['plot two teen couples go to a church party , drink and then drive', 'they get into an accident', 'one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares', "what 's the deal \\?", 'watch the movie and sorta find out', 'critique a mind fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package', "which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such \\( lost highway memento \\) , but there are good and bad ways of making all types of films , and these folks just did n't snag this one correctly", 'they seem to have taken this pretty neat concept , but executed it terribly', 'so what are the problems with the movie \\?', "well , its main problem is that it 's simply too jumbled", "it starts off normal but then downshifts into this fantasy world in which you , as an audience member , have no idea wh