# sentimental classification using RNN

In [0]:
# Build and compile the model
model = Sequential()
model.add(Embedding(vocabulary_size, wordvec_dim, trainable=True, input_length=max_text_len))
model.add(LSTM(64, return_sequences=64, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(64, return_sequences=False, dropout=0.2, recurrent_dropout=0.15))
model.add(Dense(16))
model.add(Dropout(rate=0.25))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Load pre-trained weights
model.load_weights('model_weights.h5')

# Print the obtained loss and accuracy
print("Loss: {0}\nAccuracy: {1}".format(*model.evaluate(X_test, y_test, verbose=0)))

# adding CNN layers

In [1]:
# Print the model summary
model_cnn.summary()

# Load pre-trained weights
model_cnn.load_weights('model_weights.h5')

# Evaluate the model to get the loss and accuracy values
loss, acc = model_cnn.evaluate(x_test, y_test, verbose=0)

# Print the loss and accuracy obtained
print("Loss: {0}\nAccuracy: {1}".format(loss, acc))

NameError: ignored

In [0]:
# Get the numerical ids of column label
numerical_ids = df.label.cat.codes

# Print initial shape
print(numerical_ids.shape)

# One-hot encode the indexes
Y = to_categorical(numerical_ids)

# Check the new shape of the variable
print(Y.shape)

# Print the first 5 rows
print(Y[:5])

In [0]:
# Create and fit tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(news_dataset.data)

# Prepare the data
prep_data = tokenizer.texts_to_sequences(news_dataset.data)
prep_data = pad_sequences(prep_data, maxlen=200)

# Prepare the labels
prep_labels = to_categorical(news_dataset.target)

# Print the shapes
print(prep_data.shape)
print(prep_labels.shape)

Transfer learning starting point

In this exercise you will see the benefit of using pre-trained vectors as a starting point for your model.

You will compare the accuracy of two models trained with two epochs. The architecture of the models is the same: One embedding layer, one LSTM layer with 128 units and the output layer with 5 units which is the number of classes in the sample data. The difference is that one model uses pre-trained vectors on the embedding layer (transfer learning) and the other doesn't.

The pre-trained vectors used were the GloVE with 200 dimension. The training accuracy history of the validation set of both models are available in the variables history_no_emb and history_emb.

In [0]:
# Import plotting package
import matplotlib.pyplot as plt

# Insert lists of accuracy obtained on the validation set
plt.plot(history_no_emb['acc'], marker='o')
plt.plot(history_emb['acc'], marker='o')

# Add extra descriptions to plot
plt.title('Learning with and without pre-trained embedding vectors')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['no_embeddings', 'with_embeddings'], loc='upper left')

# Display the plot
plt.show()

Word2Vec

In this exercise you will create a Word2Vec model using Keras.

The corpus used to pre-train the model is the script of all episodes of the The Big Bang Theory TV show, divided sentence by sentence. It is available in the variable bigbang.

The text on the corpus was transformed to lower case and all words were tokenized. The result is stored in the tokenized_corpus variable.

A Word2Vec model was pre-trained using a window size of 10 words for context (5 before and 5 after the center word), words with less than 3 occurrences were removed and the skip gram model method was used with 50 dimension. The model is saved on the file bigbang_word2vec.model.

The class Word2Vec is already loaded in the environment from gensim.models.word2vec.

In [0]:
# Word2Vec model
w2v_model = Word2Vec.load('bigbang_word2vec.model')

# Selected words to check similarities
words_of_interest = ['bazinga', 'penny', 'universe', 'spock','brain']

# Compute top 5 similar words for each of the words of interest
top5_similar_words = []
for word in words_of_interest:
    top5_similar_words.append(
      {word: [item[0] for item in w2v_model.wv.most_similar([word], topn=5)]}
    )

# Print the similar words
print(top5_similar_words)

Exploring 20 News Groups dataset

In this exercise, you will be given a sample of the 20 News Groups dataset obtained using the fetch_20newsgroups() function from sklearn.datasets, filtering only three classes: sci.space, alt.atheism and soc.religion.christian.

The dataset is loaded in the variable news_dataset. Its attributes are printed so you can explore them on the console.

Fore more details on how to use this function, see the Sklearn documentation.

You will tokenize the texts and one-hot encode the labels step by step to understand how the transformations happen.

In [0]:
# See example article
print(news_dataset.data[5])

# Transform the text into numerical indexes
news_num_indices = tokenizer.texts_to_sequences(news_dataset.data)

# Print the transformed example article
print(news_num_indices[5])

# Transform the labels into one-hot encoded vectors
labels_onehot = to_categorical(news_dataset.target)

# Check before and after for the sample article
print("Before: {0}\nAfter: {1}".format(news_dataset.target[5], labels_onehot[5]))

Classifying news articles

In this exercise you will create a multi-class classification model.

The dataset is already loaded in the environment as news_novel. Also, all the pre-processing of the training data is already done and tokenizer is also available in the environment.

A RNN model was pre-trained with the following architecture: use the Embedding layer, one LSTM layer and the output Dense layer expecting three classes: sci.space, alt.atheism, and soc.religion.christian. The weights of this trained model are available on the classify_news_weights.h5 file.

You will pre-process the novel data and evaluate on a new dataset news_novel.

In [0]:
# Change text for numerical ids and pad
X_novel = tokenizer.texts_to_sequences(news_novel.data)
X_novel = pad_sequences(X_novel, maxlen=400)

# One-hot encode the labels
Y_novel = to_categorical(news_novel.target)

# Load the model pre-trained weights
model.load_weights('classify_news_weights.h5')

# Evaluate the model on the new dataset
loss, acc = model.evaluate(X_novel, Y_novel, batch_size=64)

# Print the loss and accuracy obtained
print("Loss:\t{0}\nAccuracy:\t{1}".format(loss, acc))

In [0]:
# Get probabilities for each class
pred_probabilities = model.predict_proba(X_test)

# Thresholds at 0.5 and 0.8
y_pred_50 = [np.argmax(x) if np.max(x) >= 0.5 else DEFAULT_CLASS for x in pred_probabilities]
y_pred_80 = [np.argmax(x) if np.max(x) >= 0.8 else DEFAULT_CLASS for x in pred_probabilities]

trade_off = pd.DataFrame({
    'Precision_50': precision_score(y_true, y_pred_50, average=None), 
    'Precision_80': precision_score(y_true, y_pred_80, average=None), 
    'Recall_50': recall_score(y_true, y_pred_50, average=None), 
    'Recall_80': recall_score(y_true, y_pred_80, average=None)}, 
  index=['Class 1', 'Class 2', 'Class 3'])

print(trade_off)

Text generation examples

In this exercise, you are going to experiment on two pre-trained models for text generation.

The first model will generate one phrase based on the character Sheldon of The Big Bang Theory TV show, and the second model will generate a Shakespeare poems up to 400 characters.

The models are loaded on the sheldon_model and poem_model variables. Also, two custom functions to help generate text are available: generate_sheldon_phrase() and generate_poem(). Both receive the pre-trained model and a context string as parameters.

In [0]:
# Context for Sheldon phrase
sheldon_context = "I’m not insane, my mother had me tested. "

# Generate one Sheldon phrase
sheldon_phrase = generate_sheldon_phrase(sheldon_model, sheldon_context)

# Print the phrase
print(sheldon_phrase)

# Context for poem
poem_context = "May thy beauty forever remain"

# Print the poem
print(generate_poem(poem_model, poem_context))