# Benjamin Wilke
## Natural Language Processing - Midterm

### Short Essay 1

<b><i>Select one career or industry that makes use of applied NLP.

Explain generally how that field or career utilizes NLP.

Explain at least some methods of NLP that are very likely to be used in the career or industry you selected.

Give at least one specific example of a use case for NLP within the chosen field, and explain how the problem or situation is (or could be) improved by applying NLP.
</i></b>


I can speak about how NLP is used in my own industry – digital advertising. In the beginning of digital advertising most advertisements were served based on keywords manually appended to content. For example – seeing an advertisement for Kraft foods on a recipe site. Today most targeted digital advertisements are enabled by cookies and mobile device identifiers that capture behavioral information across the open web, social platforms, and mobile apps. The example here is seeing an advertisement for a sweatshirt you looked at on Banana Republic while browsing ESPN (the sweatshirt has nothing to do with sports). However, there has been recent scrutiny on the privacy of Internet users and Google announced the deprecation of the 3rd-party cookie in their Chrome browser to follow suit with Firefox and Safari. With these concerns has come a new method for targeting users based on content – contextual targeting. While not unlike simple keyword targeting in the golden age of digital advertising – these contextual targeting platforms are able to apply much more information in an automated fashion utilizing NLP.

The use cases for contextual targeting are two-fold. First – it can be used to ensure that your brand’s advertisements are running in a brand safe environment. Utilizing simple keywords appended manually by content authors or simply analyzing the URL is not a good indicator of the actual content of an article. Most advertisers try and avoid running advertisements alongside content falling into what are known as the “dirty dozen” toxic content categories, which include: pornography, crime, obscenity, terrorism, hate speech, and others. The other use case is to target users alongside relevant content, including late breaking news and trending themes that may otherwise not be found in targetable taxonomies. 
Given the large total sequence length of most Internet articles it’s likely that this technology utilizes simple bag-of-words methods for content classification (versus a method like a recurrent neural network). This can be achieved through the utilization of TF/IDF in a supervised manner by comparing new content to known content. To enable the ability to quickly target late breaking news articles can be identified based on new words and word combinations, while also scoring the new content in terms of sentiment or the presence of non-brand safe content.

I work for Oracle Data Cloud and we recently acquired Grapeshot, which is a contextual targeting platform that was cofounded by Dr. Martin Porter (who invented the Porter stemmer).

https://www.oracle.com/a/ocom/docs/corporate/acquisitions/grapeshot-methodology.pdf

### Short Essay 2

<b><i>Choose one of the “trade-offs” in NLP that was covered in the asynchronous materials for this course.

Explain the trade-off in general terms. Define the two choices.

Explain the benefits and weaknesses of each side of the trade-off.  Include at least one benefit and one weakness of each.

Describe a work-situation that would make one of the choices in the trade-off much better, in terms of practical outcomes for you and your stakeholders on a project.</i></b>

While not unique to NLP, one of the interesting “trade-offs” in NLP is transparent vs. opaque. These terms describe an algorithm and are sometimes substituted with “AI” (artificial intelligence) vs. “XAI” (explainable artificial intelligence). Often in machine learning there is tradeoff between algorithms that are verbose in how they are classifying an outcome ("transparent") vs. a “black box” where specific qualities related to an outcome aren’t known ("opaque"). It turns out that black box methods tend to be more powerful, so it’s not as easy as simply opting for a more transparent approach/algorithm.

The benefit of a transparent algorithm is it often has readily interpretable results. For example – being able to understand specifically which features contribute to an outcome (and by how much). This is often helpful when a business case requires understanding about which aspects of the data should be further explored to optimize a result. Unfortunately, a weakness of transparent algorithms is that they may not provide the most predictive power as statistical subtleties and interactions between features are often excluded from the models (as they cannot easily be explained).

Opaque algorithms often excel at pure predictive power and accuracy and are mostly dominated by the deep learning approaches that have taken the data science community by storm. Business cases where these algorithms are relevant concern improving the accuracy of predictions at the expense of interpretability. That is to say – the model may have fantastic predictive power, but drilling into one of the potentially millions of parameters in a deep neural network has no practical interpretation.

An example of the application of a transparent algorithm is estimating home prices using regression. Perhaps the business stakeholders are willing to have less accurate forecasting if they can understand the top 3 contributors to the home price. This information can be used to inform marketing messaging or sales prospecting.

Finally, an example of applying an opaque “black box” algorithm would be using a recurrent neural network to classify sentiment in movie reviews (like we’ve done in this class). The business stakeholders likely don’t care about which specific words or interactions between words harness the most predictive power, but rather they care about maximizing the accuracy of the model.


### NLP Networks

<b><i>Label each block and step by input/sequence step.  Compute the dimensions of the weight for all steps.  All inputs must be labeled by dimension.  Include your original word ENCODING (notice not vector!) as input.  You may omit bias!</i></b>

In [81]:
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import add
from tensorflow.keras.layers import GRU, Embedding, Dense

max_sent_length = 8

#X_train = sequence.pad_sequences(X_train, maxlen=max_sent_length)
#X_test = sequence.pad_sequences(X_test, maxlen=max_sent_length)

embedding_vecor_length = 80

vocab = {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumped': 4, 'over': 5, 'fence': 6, 'under': 7, 'car' : 8, 'did': 9}

top_words = len(vocab) #<-- had to get the length of vocabulary as input to determine the shape of embedding layer

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_sent_length))
model.add(GRU(115, return_sequences=True))
model.add(GRU(95))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 8, 80)             800       
_________________________________________________________________
gru_15 (GRU)                 (None, 8, 115)            67965     
_________________________________________________________________
gru_16 (GRU)                 (None, 95)                60420     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 96        
Total params: 129,281
Trainable params: 129,281
Non-trainable params: 0
_________________________________________________________________


![](GRUNetworkDiagram.png)

<b><i>Write the initial vector form of the input sequence using only 1s and 0s</i></b>

In [67]:
vocab = {'<<PAD>>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'fence': 7, 'under': 8, 'car' : 9, 'did': 10}
vocab

{'<<PAD>>': 0,
 'the': 1,
 'quick': 2,
 'brown': 3,
 'fox': 4,
 'jumped': 5,
 'over': 6,
 'fence': 7,
 'under': 8,
 'car': 9,
 'did': 10}

I added PAD at index 0, which increases the vocabulary size to 11. This means the one-hot encoded sequence will need to be 8 x 11, which is max_sent_length x vocab size.

In [82]:
onehot_vec = np.array([[1,0,0,0,0,0,0,0,0,0,0],  # <<PAD>>
                       [1,0,0,0,0,0,0,0,0,0,0],  # <<PAD>>
          [0,1,0,0,0,0,0,0,0,0,0],  # the
          [0,0,0,0,0,0,0,0,0,1,0],  # car
          [0,0,0,0,0,1,0,0,0,0,0],  # jumped
          [0,0,0,0,0,0,1,0,0,0,0],  # over 
          [0,1,0,0,0,0,0,0,0,0,0],  # the
          [0,0,0,0,0,0,0,1,0,0,0],  # fence
        ])

onehot_vec

array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

In [75]:
onehot_vec.shape

(8, 11)

<b><i>Find the average Glove Word Vector of your input sequence - "the car jumped over the fence" (Spacy uses Glove vectors!)!</i></b>

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

In [145]:
average_seq = (nlp("the").vector + nlp("car").vector + nlp("jumped").vector + nlp("over").vector + nlp("the").vector + nlp("fence").vector) / 6
average_seq

array([ 1.02246672e-01,  6.65950105e-02, -1.49527833e-01,  3.49901654e-02,
        2.13985667e-01, -7.65089318e-02, -2.45101675e-01,  7.83283338e-02,
        2.09699962e-02,  2.41605020e+00, -1.05761170e-01,  5.89891970e-02,
        3.98833267e-02, -6.72252774e-02, -2.43400812e-01, -1.21773286e-02,
       -7.25516751e-02,  1.03658164e+00, -2.32895855e-02, -1.55521646e-01,
        2.29424983e-02, -8.47426429e-02,  2.09339336e-01,  1.32689821e-02,
        1.60996635e-02, -1.53346330e-01, -8.59160051e-02, -1.78890824e-01,
       -9.34920013e-02, -2.61886623e-02,  4.54970933e-02,  2.78918356e-01,
       -1.12586327e-01,  2.80501485e-01,  1.23955004e-01, -1.41709670e-01,
       -4.83516753e-02,  1.34987980e-02,  1.60396993e-02, -3.77281681e-02,
        2.11798310e-01,  4.15756665e-02, -5.41546196e-02, -6.71449974e-02,
        8.97116885e-02, -4.67299968e-02, -2.15548679e-01, -3.80335063e-01,
        6.24183305e-02,  1.80273965e-01, -4.30923253e-02, -6.58916831e-02,
       -1.00524008e-01,  

<b><i>Find the nearest word (in the above dictionary) to the average calculated in previous question</i></b>

In [165]:
from scipy.spatial import distance

# The Cosine Similarity values for different documents, 1 (same direction), 0 (90 deg.), -1 (opposite directions).
# Cosine Distance, 1 - Cosine Similarity, then can take on values from 0 to 2 

vocab = {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumped': 4, 'over': 5, 'fence': 6, 'under': 7, 'car' : 8, 'did': 9}

list(zip(vocab, [distance.cosine(average_seq, nlp(word).vector) for word, idx in vocab.items()]))

[('the', 0.24667346477508545),
 ('quick', 0.532218724489212),
 ('brown', 0.716812789440155),
 ('fox', 0.689554899930954),
 ('jumped', 0.3494679927825928),
 ('over', 0.25146275758743286),
 ('fence', 0.3551366925239563),
 ('under', 0.508454829454422),
 ('car', 0.3474304676055908),
 ('did', 0.4202675223350525)]

Not surprisingly the average GloVe vector from our input sequence is closest to the word "the". This is mostly due to the fact that the word "the" is in the input sequence twice. Interestingly - the next most similar word is "over", which means that "over" must be more similar to "the" than to "car" or "jumped". 

In [174]:
distance.cosine(nlp("the").vector, nlp("over").vector)

0.45474982261657715

In [175]:
distance.cosine(nlp("the").vector, nlp("car").vector)

0.6812600791454315

In [176]:
distance.cosine(nlp("the").vector, nlp("jumped").vector)

0.7429614961147308

<b><i>What is the difference between the W(weight) matrix of the first GRU sequence at time/sequence 0 and at time/sequence 5.  How do you know this?</i></b>

There is no difference in the weight matrix itself as backpropagation has not yet occurred. Once the final GRU output is produced, gradients can be calculated from the loss and the weights will be updated before the next sequence is loaded.

However, we can say that the current hidden state of the GRU reccurrent layer has been informed by each word (represesnted by word embedding) for time/sequence 0-4. The hidden state for time 0-4 is fed into the current GRU iteration with the word embedding for time/sequence 5 to produce the next hidden state to pass to time/sequence 6, until the total sequence length (in our example, 8) is reached.

<b><i>What is missing in the above code—something important is not determined and based on that, there are some minor adjustments or additions that need to be made
    
Make a logical determination of what that missing piece of info should be based on the info given here and what additions or adjustments are necessary </i></b>

I think this is referring to the vocabulary size and/or the addition of padding and unknown tokens/indices in the dictionary, which are some of the things I had to consider to answer the questions in this midterm.

I first noticed this when top_words wasn't defined in the professor's code. This is required input to the embedding layer as the shape of the embedding weights is (vocabulary size x embedding length). "top_words" also implies that the code the professor was using was likely analyzing the most used words in the corpus to determine which words to keep in the dictionary. Words that aren't in the dictionary get mapped to an "unknown" token and this is generally done to speed up processing and/or reduce the size of the total vocabulary in memory. 

Assuming professor wanted us to maintain the sequence length of 8 for question 2 I had to introduce a padding token and index to use to extend the sequence length to 8. It's always a best practice to include the padding and unknown token/index from the beginning of dictionary creation as they are almost always needed for sequence learning natural language models.