## Deep Learning Course (980)
## Assignment Three 

__Assignment Goals:__

- Implementing RNN based language models.
- Implementing and applying a Recurrent Neural Network on text classification problem using TensorFlow.
- Implementing __many to one__ and __many to many__ RNN sequence processing.

In this assignment, you will implement RNN-based language models and compare extracted word representation from different models. You will also compare two different training methods for sequential data: Truncated Backpropagation Through Time __(TBTT)__ and Backpropagation Through Time __(BTT)__. 
Also, you will be asked to apply Vanilla RNN to capture word representations and solve a text classification problem. 


__DataSets__: You will use two datasets, an English Literature for language model task (part 1 to 4) and 20Newsgroups for text classification (part 5). 


1. (30 points) Implement the RNN based language model described by Mikolov et al.[1], also called __Elman network__ and train a language model on the English Literature dataset. This network contains input, hidden and output layer and is trained by standard backpropagation (TBTT with τ = 1) using the cross-entropy loss. 
   - The input represents the current word while using 1-of-N coding (thus its size is equal to the size of the vocabulary) and vector s(t − 1) that represents output values in the hidden layer from the previous time step. 
   - The hidden layer is a fully connected sigmoid layer with size 500. 
   - Softmax Output Layer to capture a valid probability distribution.
   - The model is trained with truncated backpropagation through time (TBTT) with τ = 1: the weights of the network are updated based on the error vector computed only for the current time step.
   
   Download the English Literature dataset and train the language model as described, report the model cross-entropy loss on the train set. Use nltk.word_tokenize to tokenize the documents. 
For initialization, s(0) can be set to a vector of small values. To improve performance, you can merge all words that occur less often than a threshold (here 3) into a special rare token (\__unk__). Note that we are not interested in the *dynamic model* mentioned in the original paper. 
To make the implementation simpler you can use Keras to define neural net layers, including Keras.Embedding. (Keras.Embedding will create an additional mapping layer compared to the Elman architecture.) 

2. (20 points) TBTT has less computational cost and memory needs in comparison with *backpropagation through time algorithm (BTT)*. These benefits come at the cost of losing long term dependencies [2]. Now let's try to investigate computational costs and performance of learning our language model with BTT. For training the Elman-type RNN with BTT, one option is to perform mini-batch gradient descent with exactly one sentence per mini-batch. (The input  size will be [1, Sentence Length]). 

    1. Split the document into sentences (you can use nltk.tokenize.sent_tokenize).
    2. For each sentence, perform one pass that computes the mean/sum loss for this sentence; then perform a gradient update for the whole sentence. (So the mini-batch size varies for the sentences with different lengths). You can truncate long sentences to fit the data in memory. 
    3. Report the model cross-entropy loss.

3. (15 points) It does not seem that simple recurrent neural networks can capture truly exploit context information with long dependencies, because of the problem that gradients vanish and exploding. To solve this problem, gating mechanisms for recurrent neural networks were introduced. Try to learn your last model (Elman + BTT) with the SimpleRnn unit replaced with a Gated Recurrent Unit (GRU). Report the model cross-entropy loss. Compare your results in terms of cross-entropy loss with two other approach(part 1 and 2). Use each model to generate 10 synthetic sentences of 15 words each. Discuss the quality of the sentences generated - do they look like proper English? Do they match the training set?
    Text generation from a given language model can be done using the following iterative process:
   1. Set sequence = \[first_word\], chosen randomly.
   2. Select a new word based on the sequence so far, add this word to the sequence, and repeat. At each iteration, select the word with maximum probability given the sequence so far. The trained language model outputs this probability. 

4. (15 points) The text describes how to extract a word representation from a trained RNN (Chapter 4). How we can evaluate the extracted word representation for your trained RNN? Compare the words representation extracted from each of the approaches using one of the existing methods.

5. (20 points) We are aiming to learn an RNN model that predicts document categories given its content (text classification). For this task, we will use the 20Newsgroupst dataset. The 20Newsgroupst contains messages from twenty newsgroups.  We selected four major categories (comp, politics, rec, and religion) comprising around 13k documents altogether. Your model should learn word representations to support the classification task. For solving this problem modify the __Elman network__ architecture such that the last layer is a softmax layer with just 4 output neurons (one for each category). 

    1. Download the 20Newsgroups dataset, and use the implemented code from the notebook to read in the dataset.
    2. Split the data into a training set (90 percent) and validation set (10 percent). Train the model on  20Newsgroups.
    3. Report your accuracy results on the validation set.

__NOTE__: Please use Jupyter Notebook. The notebook should include the final code, results and your answers. You should submit your Notebook in (.pdf or .html) and .ipynb format. (penalty 10 points) 

__Instructions__:

The university policy on academic dishonesty and plagiarism (cheating) will be taken very seriously in this course. Everything submitted should be your own writing or coding. You must not let other students copy your work. Spelling and grammar count.

Your assignments will be marked based on correctness, originality (the implementations and ideas are from yourself), clarification and test performance.


[1] Tom´ as Mikolov, Martin Kara ˇ fiat, Luk´ ´ as Burget, Jan ˇ Cernock´ ˇ y,Sanjeev Khudanpur: Recurrent neural network based language model, In: Proc. INTERSPEECH 2010

[2] Tallec, Corentin, and Yann Ollivier. "Unbiasing truncated backpropagation through time." arXiv preprint arXiv:1705.08209 (2017).



In [5]:

"""This code is used to read all news and their labels"""
import os
import glob

def to_categories(name, cat=["politics","rec","comp","religion"]):
    for i in range(len(cat)):
        if str.find(name,cat[i])>-1:
            return(i)
    print("Unexpected folder: " + name) # print the folder name which does not include expected categories
    return("wth")

def data_loader(images_dir):
    categories = os.listdir(data_path)
    news = [] # news content
    groups = [] # category which it belong to
    
    for cat in categories:
        print("Category:"+cat)
        for the_new_path in glob.glob(data_path + '/' + cat + '/*'):
            news.append(open(the_new_path,encoding = "ISO-8859-1", mode ='r').read())
            groups.append(cat)

    return news, list(map(to_categories, groups))



data_path = "datasets/20news_subsampled"
news, groups = data_loader(data_path)

Category:talk.politics.misc
Category:talk.politics.mideast
Category:talk.religion.misc
Category:comp.windows.x
Category:soc.religion.christian
Category:rec.motorcycles
Category:rec.autos
Category:talk.politics.guns
Category:comp.graphics
Category:comp.sys.ibm.pc.hardware
Category:rec.sport.baseball
Category:comp.os.ms-windows.misc
Category:rec.sport.hockey
Category:comp.sys.mac.hardware


In [6]:
from collections import Counter
English_literature_path = './datasets/English Literature.txt'
English_literature = open(English_literature_path).read()

# print(English_literature[:100])

# UW = Counter(English_literature)
# print(UW)

In [7]:
import nltk
# nltk.download('all')

In [8]:
English_sentences = nltk.sent_tokenize(English_literature)

In [9]:
print(English_sentences[:1])

['First Citizen:\nBefore we proceed any further, hear me speak.']


# Question 1

## *unk* is used for words with frequency less than 3

## 17% accuracy achieved after 5 epoches 

In [10]:
English_words = nltk.word_tokenize(English_sentences[0])
print(English_words)

['First', 'Citizen', ':', 'Before', 'we', 'proceed', 'any', 'further', ',', 'hear', 'me', 'speak', '.']


In [11]:
English_words = nltk.word_tokenize(English_literature)
print(English_words[:5])

['First', 'Citizen', ':', 'Before', 'we']


In [12]:


unique_words =  Counter(English_words)

print(len(unique_words))

14309


In [13]:
# print(English_words[:50])

for i in range(0,len(English_words)):
    
    if unique_words[English_words[i]] <=3:
        English_words[i] = "*unk*";

# print(English_words[:50])


unique_words_new =  Counter(English_words)

# print((unique_words_new))

print(len(English_words))
print(len(unique_words_new))

254533
4243


In [14]:
import numpy as np
vocab = sorted(set(English_words))
word_2_id = {u:i for i, u in enumerate(vocab)}
id_2_word = np.array(vocab)
w2i1 = word_2_id
English_word_as_id = np.array([word_2_id[c] for c in English_words])
print(English_words[:20])
print(English_word_as_id[:2])

# print(word_2_id[English_words[0]])
print(English_word_as_id.shape)

['First', 'Citizen', ':', 'Before', 'we', 'proceed', 'any', 'further', ',', 'hear', 'me', 'speak', '.', 'All', ':', 'Speak', ',', 'speak', '.', 'First']
[286 192]
(254533,)


In [15]:

sequence = []

for i in range (0,len(English_word_as_id)-1):
    
#     input_data.append([English_word_as_id[i]])
#     output_data.append(English_word_as_id[i+1])
    
    sequence.append([[English_word_as_id[i]],[English_word_as_id[i+1]]])
    
print(np.array(sequence).shape)

(254532, 2, 1)


In [16]:
# print(input_data)
# print(sequence[:5])
import tensorflow as tf
from keras.utils import to_categorical
inp = (np.array(sequence))
input_data = inp[:,0]
output_data = inp[:,1]
# print(inp[:5])
outp = to_categorical(output_data)

# print(inp[:5])
# print(outp[:5])
print((len(input_data)))
print(len(outp[1,:]))
print(len(English_word_as_id))

254532
4243
254533


In [13]:
vocab_size = len(outp[1,:])
embedding_dim = 256
rnn_units = 500
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    
    tf.keras.layers.SimpleRNN(rnn_units,
                        activation = 'sigmoid',
                        recurrent_initializer='glorot_uniform'),
    
    tf.keras.layers.Dense(vocab_size, activation = 'softmax')
  
])


In [14]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 256)         1086208   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 500)               378500    
_________________________________________________________________
dense (Dense)                (None, 4243)              2125743   
Total params: 3,590,451
Trainable params: 3,590,451
Non-trainable params: 0
_________________________________________________________________


In [15]:
model.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['acc'])

In [16]:
model.fit(input_data, outp, batch_size = 500, epochs=5, shuffle=False)

Train on 254532 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f959c79a8d0>

# Some prediction with the model

The model predicts unknown words and tried to mimic the training set.

In [141]:
model.save_weights('modelQ1')

w_test = "First"
input_test = word_2_id[w_test]

# print(input_test)
print([w_test])
pre_word = model.predict_classes([input_test])

for p in range (0,5):
    print(id_2_word[pre_word])
    
    pre_word = model.predict_classes([pre_word])

['First']
['Servant']
[':']
['I']
['have']
['*unk*']


# Question 2

## *unk* is used for words with frequency less than 3

## 49.99% accuracy achieved after 5 epoches 

pad sequence was used for post padding and truncating

In [17]:
English_words_from_sentence = []
for x in English_sentences:
    
    word = nltk.word_tokenize(x)
#     print(word[1])
    
    for i in range(0,len(word)):
    
        if unique_words[word[i]] <=3:
            word[i] = "*unk*";

    English_words_from_sentence.append(word)
#     print(English_words)

# print(English_words_from_sentence)

In [18]:
import numpy as np
from keras.preprocessing.sequence import pad_sequences
# unique_words = Counter(English_words_from_sentence)
# print(unique_words[:5])
# vocab = sorted(set(English_words_from_sentence))
# word_2_id = {u:i for i, u in enumerate(vocab)}
# id_2_word = np.array(vocab)
# English_words = nltk.word_tokenize(English_literature)
vocab = sorted(set(English_words))
vocab = np.append("PAD",vocab)
word_2_id = {u:i for i, u in enumerate(vocab)}

w2i2 = word_2_id
id_2_word = np.array(vocab)

English_sentence_2_id = []
for sentence in English_words_from_sentence:
    sentence_id = []
    for word in sentence:
        
        sentence_id.append(word_2_id[word])
        
    English_sentence_2_id.append(sentence_id)
    
    
# print(English_words_from_sentence[0])    
print(English_sentence_2_id[0])


Padded_English_sentence_2_id = pad_sequences(English_sentence_2_id, maxlen = 21, padding = 'post', truncating='post')
# print(id_2_word[0])

# print(word_2_id["Resolved"])

        

# English_word_as_id = np.array([word_2_id[c] for c in English_words_from_sentence])



# id_2_word= np.append(id_2_word,"*unk*")
# word_2_id["*unk*"] = len(id_2_word)-1 
# # print(word_2_id["*unk*"])

# for i in range (0, len(English_words_from_sentence)):
    
#     if len(English_words_from_sentence[i]) >50:
        
#         English_words_from_sentence[i]  = English_words_from_sentence[i][:50]
        
#     else:
#         while len(English_words_from_sentence[i]) < 50:
#             English_words_from_sentence[i].append("*unk*")
            
    

# print(English_words_from_sentence[1])

# English_sentence_2_id = []
# for sentence in English_words_from_sentence:
#     sentence_id = []
#     for word in sentence:
        
#         sentence_id.append(word_2_id[word])
        
#     English_sentence_2_id.append(sentence_id)
    
# print(English_words_from_sentence[1])
# print(English_sentence_2_id[1])

[287, 193, 41, 127, 4054, 3065, 969, 2011, 37, 2179, 2620, 3551, 39]


In [19]:
import tensorflow as tf
from keras.utils import to_categorical
English_sentence_2_id = np.array(English_sentence_2_id)
# print(len(English_sentence_2_id[:]))
# English_sentence_2_id = np.reshape(English_sentence_2_id,(len(English_sentence_2_id),20))

# inp = (np.array(English_sentence_2_id))
input_data = Padded_English_sentence_2_id[:,:-1]
print(input_data.shape)
output_data = Padded_English_sentence_2_id[:,1:]
# print(input_data[1])
# print(inp[:5])
outp = to_categorical(output_data)
print(outp.shape)

# print(inp[:5])
# print(outp[:5])
# print((len(input_data)))
# print(len(outp[1,:]))
# print(len(English_word_as_id))

# print(English_sentence_2_id[1])
# output_data = English_sentence_2_id[:,1]
# input_data = English_sentence_2_id[:,0]
# # output_data = English_sentence_2_id[:][1:]
# print(input_data.shape)
# print(output_data.shape)
# outp = to_categorical(output_data, num_classes = len(vocab)+1)
# print(len(input_data[:,1]))
# print(len(output_data[:,1]))
        

(12460, 20)
(12460, 20, 4244)


In [20]:
vocab_size = outp.shape[2]
embedding_dim = 256
rnn_units = 500
modelQ2 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,input_length = 20),
    
    tf.keras.layers.SimpleRNN(rnn_units,
                        activation = 'sigmoid',
                        return_sequences = True,
                        recurrent_initializer='glorot_uniform'),
    
    tf.keras.layers.Dense(vocab_size, activation = 'softmax')
  
])


In [21]:
modelQ2.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['acc'])

In [29]:
modelQ2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 256)           1086464   
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 20, 500)           378500    
_________________________________________________________________
dense_1 (Dense)              (None, 20, 4244)          2126244   
Total params: 3,591,208
Trainable params: 3,591,208
Non-trainable params: 0
_________________________________________________________________


In [30]:
modelQ2.fit(input_data, outp,batch_size = 1, epochs=5, shuffle=False)

Train on 12460 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f96a70ef630>

# Some Prediction on model 2

## Was not meaning full text but didnot produce unknown words

In [150]:
modelQ2.save_weights('modelQ2')

# print(Padded_English_sentence_2_id[0])
w_test = [287, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# input_test = word_2_id[w_test]
# pad_test_word = pad_sequences(w_test, maxlen = 20, padding = 'post', truncating='post')
# print(input_test)
print(id_2_word[w_test[0]])
pre_word = modelQ2.predict_classes([w_test])
# print(pre_word[0])
for p in range (0,5):
    print(id_2_word[pre_word[0,0]])
    
    pre_word = modelQ2.predict_classes([pre_word])

First
Servant
:
I
am
I


# Question 3

## data same as Question 2

## 65.5% accuracy achieved after 10 epoches 

In [22]:
vocab_size = outp.shape[2]
embedding_dim = 256
rnn_units = 500
modelQ3 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,input_length = 20),
    
    tf.keras.layers.GRU(rnn_units,
                        activation = 'sigmoid',
                        return_sequences = True,
                        recurrent_initializer='glorot_uniform'),
    
    tf.keras.layers.Dense(vocab_size, activation = 'softmax')
  
])


In [23]:
modelQ3.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['acc'])

In [24]:
modelQ3.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 20, 256)           1086464   
_________________________________________________________________
gru (GRU)                    (None, 20, 500)           1137000   
_________________________________________________________________
dense_1 (Dense)              (None, 20, 4244)          2126244   
Total params: 4,349,708
Trainable params: 4,349,708
Non-trainable params: 0
_________________________________________________________________



## I by mistake made the cell of model 3 output as markdown so my original output for 10 epoches vanished at the very last moment. Although I saved the model and I also can rerun the same notebook and send you an image of same accuracy as mentioned.

In [25]:
modelQ3.fit(input_data, outp,batch_size = 1, epochs=5, shuffle=False)

Train on 12460 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f94d8407908>

# Some prediction by model 3

In [26]:
modelQ3.save_weights('modelQ3')

# print(Padded_English_sentence_2_id[0])
w_test = [287, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# input_test = word_2_id[w_test]
# pad_test_word = pad_sequences(w_test, maxlen = 20, padding = 'post', truncating='post')
# print(input_test)
print(id_2_word[w_test[0]])
for p in range (0,5):
    print(id_2_word[pre_word[0,0]])
    
    pre_word = modelQ3.predict_classes([pre_word])

First


NameError: name 'pre_word' is not defined

# Question 4 

## Euclidean similarity was used.

## Embedding data from the first layer of every network  was taken

## Results shows Network 3 to be best by intrinsic evaluation 

In [140]:
import math 
from sklearn.metrics.pairwise import euclidean_distances
data = model.layers[0].get_weights()[0]
data2 = modelQ2.layers[0].get_weights()[0]
data3 = modelQ3.layers[0].get_weights()[0]

embed1 = dict()
for w, i in w2i1.items():
    embed1[w] = data[i]




print("----------------------Model->1---------------------")

print("GOD, Good ")
print((1+euclidean_distances([embed1['God']], [embed1['Good']]))**(-1))

print("Affection, love ")
print((1+euclidean_distances([embed1['Affection']], [embed1['love']]))**(-1))

print("love, like ")
print((1+euclidean_distances([embed1['love']], [embed1['like']]))**(-1))

print("above, under ")
print((1+euclidean_distances([embed1['above']], [embed1['under']]))**(-1))




embed2 = dict()
for w, i in w2i2.items():
    embed2[w] = data2[i]




print("----------------------Model->2---------------------")

print("GOD, Good ")
print((1+euclidean_distances([embed2['God']], [embed2['Good']]))**(-1))

print("Affection, love ")
print((1+euclidean_distances([embed2['Affection']], [embed2['love']]))**(-1))

print("love, like ")
print((1+euclidean_distances([embed2['love']], [embed2['like']]))**(-1))

print("above, under ")
print((1+euclidean_distances([embed2['above']], [embed2['under']]))**(-1))



embed3 = dict()
for w, i in w2i2.items():
    embed3[w] = data3[i]




print("----------------------Model->3---------------------")

print("father, mother ")
print((1+euclidean_distances([embed3['father']], [embed3['mother']]))**(-1))

print("king, queen ")
print(1/(1+euclidean_distances([embed3['king']], [embed3['queen']])))

print("love, like ")
print(1/(1+euclidean_distances([embed3['love']], [embed3['like']])))

print("east, west ")
print(1/(1+euclidean_distances([embed3['east']], [embed3['west']])))

----------------------Model->1---------------------
GOD, Good 
[[0.254629]]
Affection, love 
[[0.42411244]]
love, like 
[[0.33480805]]
above, under 
[[0.4154392]]
----------------------Model->2---------------------
GOD, Good 
[[0.16213137]]
Affection, love 
[[0.23802963]]
love, like 
[[0.19777091]]
above, under 
[[0.2101234]]
----------------------Model->3---------------------
father, mother 
[[0.60517615]]
king, queen 
[[0.6159821]]
love, like 
[[0.5884967]]
east, west 
[[0.5954269]]


# Question 5

## Results show 72% accuracy in the model after 30 epoches.

In [77]:

"""This code is used to read all news and their labels"""
import os
import glob

def to_categories(name, cat=["politics","rec","comp","religion"]):
    for i in range(len(cat)):
        if str.find(name,cat[i])>-1:
            return(i)
    print("Unexpected folder: " + name) # print the folder name which does not include expected categories
    return("wth")

def data_loader(images_dir):
    categories = os.listdir(data_path)
    news = [] # news content
    groups = [] # category which it belong to
    
    for cat in categories:
        print("Category:"+cat)
        for the_new_path in glob.glob(data_path + '/' + cat + '/*'):
            news.append(open(the_new_path,encoding = "ISO-8859-1", mode ='r').read())
            groups.append(cat)

    return news, list(map(to_categories, groups))



data_path = "datasets/20news_subsampled"
news, groups = data_loader(data_path)

Category:talk.politics.misc
Category:talk.politics.mideast
Category:talk.religion.misc
Category:comp.windows.x
Category:soc.religion.christian
Category:rec.motorcycles
Category:rec.autos
Category:talk.politics.guns
Category:comp.graphics
Category:comp.sys.ibm.pc.hardware
Category:rec.sport.baseball
Category:comp.os.ms-windows.misc
Category:rec.sport.hockey
Category:comp.sys.mac.hardware


In [78]:
type(news)

list

In [79]:
# print(words[:1])

In [82]:
# word = np.array(len(words))
# for i in words:
#     for j in i:
        
#         word = np.append(word,j)


# type(words)

# words[:1]

# unique_words =  Counter(words)

# print(len(unique_words))
# unique_words = [] 
# for j in words:
# unique_words.Counter(words)

# print(len(unique_words))
words = []
for i in news:
    words.append(nltk.word_tokenize(i))
print(words[:1])

[['From', ':', 'gld', '@', 'cunixb.cc.columbia.edu', '(', 'Gary', 'L', 'Dare', ')', 'Subject', ':', 'Re', ':', 'EIGHT', 'MYTHS', 'about', 'National', 'Health', 'Insurance', '(', 'Pt', 'II', ')', 'v140pxgt', '@', 'ubvmsb.cc.buffalo.edu', '(', 'Daniel', 'B', 'Case', ')', 'writes', ':', '>', 'gld', '@', 'cunixb.cc.columbia.edu', '(', 'Gary', 'L', 'Dare', ')', 'writes', '...', '>', '>', 'v140pxgt', '@', 'ubvmsb.cc.buffalo.edu', '(', 'Daniel', 'B', 'Case', ')', 'writes', ':', '>', '>', '>', 'gld', '@', 'cunixb.cc.columbia.edu', '(', 'Gary', 'L', 'Dare', ')', 'writes', '...', '>', '>', 'Okay', ',', 'but', 'do', 'doctors', 'willingly', 'testify', 'against', 'each', 'other', 'in', '>', 'malpractice', 'cases', 'when', 'they', 'do', 'go', 'to', 'court', '(', 'obviously', ',', 'absolutely', '>', 'essential', 'to', 'prove', 'malpractice', ')', '?', 'It', 'used', 'to', 'be', 'impossible', 'to', 'get', '>', 'doctors', 'here', 'to', 'do', 'that', '(', 'A', 'possible', 'advantage', 'of', 'the', 'US', 

In [83]:
import itertools
word = list(itertools.chain.from_iterable(words))
# print(word[:200])
print(len(word))
# print(j)
unique_words =  Counter(word)

print(len(unique_words))

5336786
230172


In [92]:
import numpy as np
vocab = sorted(set(word))
vocab = np.append("PAD",vocab)
word_2_id = {u:i for i, u in enumerate(vocab)}
id_2_word = np.array(vocab)

word_as_id = np.array([word_2_id[c] for c in word])
print(word[:20])
print(word_as_id[:20])

# print(word_2_id[English_words[0]])
print(word_as_id.shape)

['From', ':', 'gld', '@', 'cunixb.cc.columbia.edu', '(', 'Gary', 'L', 'Dare', ')', 'Subject', ':', 'Re', ':', 'EIGHT', 'MYTHS', 'about', 'National', 'Health', 'Insurance']
[ 80755  51520 185439  54243 174811   3510  83171  97140  72673   3511
 135215  51520 127583  51520  75354 108539 161319 114433  87033  91098]
(5336786,)


In [93]:
words_2_id = []


for news_sequence in words:
    news_id = []
    for w in news_sequence:
        
        news_id.append(word_2_id[w])
        
    words_2_id.append(news_id)

In [94]:
# print(words_2_id[:1])
# print(words_2_id[:2])
# # print(words[:1])

In [155]:
Padded_news_id = pad_sequences(words_2_id, maxlen = 250, padding = 'post', truncating='post')

In [156]:
# input_data = tf.convert_to_tensor(words_2_id)

In [157]:
input_data[5]

array([ 80755,  51520, 163409,  54243, 217491,   3510, 109749,  58101,
         3511, 135215,  51520, 211517,  80793, 200607, 209915,  11567,
        88143, 178042, 199787, 221259, 221314, 185772, 206664,  11567,
        90595, 191726, 177005,   8339, 189425, 191726, 221314, 221667,
        11472,   3510,  54244, 211517, 221868,  96865,   3511,   8890,
         8890,   8890,   8890,   8890,   8890, 166441, 199696, 211517,
         8890,   8890,   8890,   8890,   8890,   8890,   8890,   8890,
         8781,  90791, 164614,  52078,  28946,  54243, 196845,  54241,
       193414,  54243, 196845,   3510,  96865,  91634,  97017,   3511,
       228056,  51520,  54241, 193918,  54243, 223333, 228056,  51520,
        54241, 155301,  11567, 157278,  54241,  54241,  54241,  54241,
       226889,   8339, 189087, 192774, 177005, 161091, 209042, 197231,
       213531, 163394, 169159, 224633, 227599, 221019, 183442, 164614,
        54241,  54241, 168792, 138178, 221002, 211778, 221019, 195055,
      

In [158]:
# print(groups)
outp = to_categorical(groups)
print(words[1][:100])
print(input_data[1])
print(outp[1])
print(input_data.shape)
print(outp.shape)

['From', ':', 'borden', '@', 'head-cfa.harvard.edu', '(', 'Dave', 'Borden', ')', 'Subject', ':', 'Drug', 'Use', 'and', 'Policy', 'in', 'Japan', 'Is', 'anyone', 'out', 'there', 'knowledgeable', 'on', 'drug', 'issues', 'in', 'Japan', '?', 'I', "'m", 'interested', 'in', 'knowing', 'if', 'Japan', 'has', 'or', 'has', 'ever', 'had', 'a', 'problem', 'with', 'drugs', ',', 'and', 'how', 'they', 'dealt', 'with', 'it', '.', 'I', "'ve", 'heard', ',', 'undocumented', ',', 'that', 'Japan', 'years', 'ago', 'used', 'heavy', 'legal', 'penalties', 'to', 'end', 'a', 'serious', 'heroin', 'problem', '.', 'I', "'d", 'like', 'to', 'know', 'both', 'sides', 'of', 'the', 'story', '.', 'Does', 'anyone', 'recall', 'such', 'a', 'problem', '?', 'What', 'were', 'laws', 'at', 'the', 'time', 'relating', 'to', 'drug']
[ 80755  51520 167770  54243 187440   3510  72772  63205   3511 135215
  51520  74183 142232 163394 122097 189900  93404  91343 163868 203332
 221149 193540 202623 178724 191706 189900  93404  54242  8814

In [166]:
train_num = int(outp.shape[0] * .9)
print(train_num)

input_train = input_data[:train_num]
outp_train = outp[:train_num]

input_test = input_data[train_num+1:]
outp_test = outp[train_num+1:]

vocab_size = len(vocab)
embedding_dim = 256
rnn_units = 500
    
modelQ5 = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    
    tf.keras.layers.SimpleRNN(rnn_units,
                        activation = 'sigmoid',
                        recurrent_initializer='glorot_uniform'),
    
    tf.keras.layers.Dense(4, activation = 'softmax')
  
])
    
modelQ5.compile(optimizer='adam', loss='categorical_crossentropy', 
              metrics=['acc'])


11797


In [167]:
modelQ5.summary()

Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_15 (Embedding)     (None, None, 256)         58924288  
_________________________________________________________________
simple_rnn_12 (SimpleRNN)    (None, 500)               378500    
_________________________________________________________________
dense_15 (Dense)             (None, 4)                 2004      
Total params: 59,304,792
Trainable params: 59,304,792
Non-trainable params: 0
_________________________________________________________________


In [169]:

modelQ5.fit(input_train, outp_train, epochs=50, batch_size = 64, validation_data=(input_test, outp_test))

Train on 11797 samples, validate on 1310 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f0b4466ee48>

In [170]:
modelQ5.save("modelQ5")