# Regular Part
The first section of the regular-level exercises, we will explore how to use our saved model to make predictions on unseen data.

In [54]:
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
import pandas as pd
import numpy as np
import pickle

### Exercise 1
Refer to Keras again on how to load your saved model, [here](https://www.tensorflow.org/guide/keras/save_and_serialize).

In [101]:
model = load_model('tagger-model.h5')

In [56]:
corpus = [
    'The articles in this special section focus on using natural language generation techniques (NLG) and natural language processing (NLP) to build computational systems that generate reports and other kinds of text in human languages.',
    'NLG uses analytics, AI, and NLP to obtain relevant information about non-linguistic data and to generate textual summaries and explanations of these data which help people understand and benefit from them.',
    'In this regard, NLG is a research field that addresses the data-value chain by using natural language as a tool for bridging the gap between raw data and valuable information communicated to users in a comprehensible way, adapted to their information needs.'
]

a. We want to be able to tag the three sentences in the corpus above [source](https://ieeexplore.ieee.org/document/7983468).

Figure out the necessary steps that should precede passing the input into your **h5** model.


In [57]:
max_len = 271

with open('tagger-tokenizer.pkl', 'rb') as tokenizer_file:
    tokenizer = pickle.load(tokenizer_file)

corpus_encoded = tokenizer.texts_to_sequences(corpus)
corpus_padded = pad_sequences(corpus_encoded, maxlen=max_len, padding='pre', truncating='pre')

b. Use the model to make the predictions on this dataset

In [60]:
output = model.predict(corpus_padded)
output



array([[[9.14407551e-01, 1.98224038e-02, 4.11222689e-03, ...,
         3.30577535e-03, 1.72776345e-04, 1.85140863e-03],
        [9.91388500e-01, 1.57263805e-03, 2.42202150e-04, ...,
         3.61038168e-04, 1.05591180e-05, 1.68475599e-04],
        [9.96884167e-01, 4.93137981e-04, 6.65069310e-05, ...,
         1.30541259e-04, 2.94757751e-06, 5.61589113e-05],
        ...,
        [3.72465365e-02, 2.85448611e-01, 1.15548983e-01, ...,
         2.32355669e-02, 4.46862401e-03, 1.95980594e-02],
        [3.72465365e-02, 2.85448611e-01, 1.15548983e-01, ...,
         2.32355669e-02, 4.46862401e-03, 1.95980594e-02],
        [3.72465365e-02, 2.85448611e-01, 1.15548983e-01, ...,
         2.32355669e-02, 4.46862401e-03, 1.95980594e-02]],

       [[9.14407551e-01, 1.98224038e-02, 4.11222689e-03, ...,
         3.30577535e-03, 1.72776345e-04, 1.85140863e-03],
        [9.91388500e-01, 1.57263805e-03, 2.42202150e-04, ...,
         3.61038168e-04, 1.05591180e-05, 1.68475599e-04],
        [9.96884167e-01, 

c. Tag the corpus above

In [98]:
tagged_corpus = []
for word_indexes, tag_probs in zip(corpus_encoded, output):
    words = [tokenizer.index_word[idx] for idx in word_indexes]
    tags = [tokenizer.index_word[np.argmax(p) + 1] for p in tag_probs]
    tags = tags[max(0, max_len-len(words)):]
    tagged_corpus.append([words, tags])

d. Save the corpus and the predictions on a file, and upload it with your submission on Canvas.

In [99]:
df = pd.DataFrame(tagged_corpus, columns=['words', 'tags'])
df.to_csv('tagged-corpus.csv', index=False)

## Advanced part

Use a scraping library, [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) should work, but feel free to choose anything you wish.

Scrape some publicly available text (public speeches, lyrics, poetry..etc), and build a corpus of your own.

Use a one-to-many RNN (any RNN you want, vanilla, LSTM, GRU or a biderectional version, etc.) to perform some natural language generation resembling the corpus your model was trained on.

You don't have to train the model for hours, just show that your model works.


In [1]:
#Add your code here