## Text Generation Using patent abstracts from patent search for `neural network`

### Files required:

1. `neural_network_patent_query.csv`

2. `train-embeddings.h5`

Copy the above files to your drive from [this](https://drive.google.com/drive/folders/1cbAesB-eejsRKdCHpnFSyXiu81Y5a5HU?usp=sharing) link.

### Read the dataset 

Use variable name for the dataframe as data

In [0]:
from google.colab import drive

In [0]:
drive.mount("/content/drive/")

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


In [0]:
project_path="/content/drive/My Drive/Residency9_RecurrentNN_And_AdvancedCNN/"

In [0]:
data_path = project_path+"neural_network_patent_query.csv"

In [0]:
!ls "/content/drive/My Drive/Residency9_RecurrentNN_And_AdvancedCNN/"

Image_Denoising_using_AutoEncoders_Lab_Questions.ipynb
neural_network_patent_query.csv
TextGeneration_Lab_Questions.ipynb
train-embeddings.h5


In [0]:
import pandas as pd
data = pd.read_csv(data_path)

### Check 1

In [0]:
data.head()

Unnamed: 0,patent_abstract,patent_date,patent_number,patent_title
0,""" A """"Barometer"""" Neuron enhances stability in...",1996-07-09,5535303,"""""""Barometer"""" neuron for a neural network"""
1,""" This invention is a novel high-speed neural ...",1993-10-19,5255349,"""Electronic neural network for solving """"trave..."
2,An optical information processor for use as a ...,1995-01-17,5383042,3 layer liquid crystal neural network with out...
3,A method and system for intelligent control of...,2001-01-02,6169981,3-brain architecture for an intelligent decisi...
4,A method and system for intelligent control of...,2003-06-17,6581048,3-brain architecture for an intelligent decisi...


In [0]:
data.shape

(3522, 4)

Now, all the patent abstract data is in `data['patent_abstract']`

For ease of access, assign a variable name `abstracts` to `data['patent_abstract']`

In [0]:
abstracts = data['patent_abstract']

### Tokenize the text

Initialize the Tokenizer class with variable name `tokenizer`

Use tokenizer.fit_on_texts(`<list of texts>`) on `abstracts`

In [0]:
import keras

Using TensorFlow backend.


In [0]:
from keras.preprocessing.text import Tokenizer

In [0]:
tokenizer = Tokenizer(lower=True, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

In [0]:
tokenizer.fit_on_texts(abstracts)

### Run the below code to extract insights from tokenizer

In [0]:
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word
word_counts = tokenizer.word_counts

### Total no.of words in given dataset acc. to Tokenizer

Run the below code

In [0]:
num_words = len(word_idx) + 1

In [0]:
print ("Total no.of words = %d" %(num_words))

Total no.of words = 11755


### The given pre-trained model `train-embeddings.h5` is on 16192 tokens, hence take num_words as 16192 to use the pre-trained model.

Run the below code.

In [0]:
num_words = 16192

### Encode words to integers using texts_to_sequences in keras

Use variable name `sequences`

In [0]:
sequences = tokenizer.texts_to_sequences(abstracts)
len(sequences)

3522

### Consider only abstracts greater than 70 words

Run the below code

In [0]:
seq_lengths = [len(x) for x in sequences]
over_idx = [i for i, l in enumerate(seq_lengths) if l > 70]

new_texts = []
new_sequences = []

# Only keep sequences with more than training length tokens
for i in over_idx:
    new_texts.append(abstracts[i])
    new_sequences.append(sequences[i])

Now, we have abstracts in new_texts and words encoded to integers in new_sequences.

### Generate features and labels

If trainining_length is 50, take every 49 sequence as feature and every last word of each 50 sequence as label.

Run the below code to generate features and labels.

In [0]:
features = []
labels = []

training_length = 50
# Iterate through the sequences of tokens
for each_sequence in new_sequences:
    
    # Create multiple training examples from each sequence
    for i in range(training_length, len(each_sequence)):
        # Extract the features and label
        extract = each_sequence[i - training_length: i + 1]
        
        # Set the features and label
        features.append(extract[:-1])
        labels.append(extract[-1])

In [0]:
print("There are %d sequences." %(len(features)))

There are 294130 sequences.


### Split into train and validation sets

1. Shuffle the features and labels accordingly.

2. Split into train and validation sets. Use variable names X_train, X_valid, y_train and y_valid accordingly. Consider 70:30

3. Convert y_train and y_valid to one-hot encodings

In [0]:
from sklearn.utils import shuffle
import numpy as np

In [0]:
features, labels = shuffle(features, labels, random_state=1)

In [0]:
len(labels)

294130

In [0]:
# Decide on number of samples for training
train_end = int(0.7 * len(labels))

train_features = np.array(features[:train_end])
valid_features = np.array(features[train_end:])

train_labels = labels[:train_end]
valid_labels = labels[train_end:]

# Convert to arrays
X_train, X_valid = np.array(train_features), np.array(valid_features)

# Using int8 for memory savings
y_train = np.zeros((len(train_labels), num_words), dtype=np.int8)
y_valid = np.zeros((len(valid_labels), num_words), dtype=np.int8)

# One hot encoding of labels
for example_index, word_index in enumerate(train_labels):
    y_train[example_index, word_index] = 1

for example_index, word_index in enumerate(valid_labels):
    y_valid[example_index, word_index] = 1

### Check 2

Run the below code to check some features and their corresponding labels.

In [0]:
for i, sequence in enumerate(X_train[:5]):
    text = []
    for idx in sequence:
        text.append(idx_word[idx])
        
    print('Features: ' + ' '.join(text) + '\n')
    print('Label: ' + idx_word[np.argmax(y_train[i])] + '\n')

Features: a memory unit 40 for storing data an arithmetic unit 42 for mathematically operating on the data a memory address generation unit 32 and an adder for computing a next memory address the memory address generation unit 32 includes an address register 34 in the memory unit for identifying the

Label: address

Features: to improve the perceptual quality of the speech and background noise under a variety of input conditions the present invention also improves the voicing dependent spectral estimation algorithm robustness by introducing the use of a multi layer neural network in the estimation process the voicing dependent spectral estimation algorithm provides

Label: an

Features: and computational efficiency to allow the mann to be applied to full digital images without operator input the hybrid filter architecture and mann may be applied to any gray scale image in medical imaging the specific application of the proposed method includes a improved enhancement or detection of sus

### Build Model

#### Consider the following details while building the model.

Embedding dimension = 100

64 LSTM cells in one layer with return_sequences as `False`

Fully connected layer with 64 units on top of LSTM

'relu' activation

Drop out for regularization

Output Dense layer with size of num_words for matching the size of one-hot encoding of each word

'softmax' activation

Categorical cross entropy loss

Metric accuracy

In [0]:
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam

In [0]:
model = Sequential()

# Embedding layer
model.add(
    Embedding(
        input_dim=len(word_idx) + 1,
        output_dim=100,
        weights=None,
        trainable=True))

# Recurrent layer
model.add(
    LSTM(
        64, return_sequences=False, dropout=0.1,
        recurrent_dropout=0.1))

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         1175500   
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16192)             1052480   
Total params: 2,274,380
Trainable params: 2,274,380
Non-trainable params: 0
_________________________________________________________________


<!-- ### Training -->

### Load in Pre-Trained Model
We can load in a model trained for 150 epochs and train this model for another 20 epochs.

1. Import `load_model` from `keras.models`

2. Load the model file `train-embeddings.h5` using `load_model`. Use variable name `model`.

3. Do model.fit() on training and validation sets to train the model. Consider batch_size as 2048 and epochs as 20.

In [0]:
from keras.models import load_model

# Load in model and demonstrate training
model = load_model(project_path + 'train-embeddings.h5')
h = model.fit(X_train, y_train, epochs = 20, batch_size = 2048, 
          validation_data = (X_valid, y_valid), 
          verbose = 1)

Train on 205891 samples, validate on 88239 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Generate Text 

Run this to check the text output by the model. This function randomly generates input of length 50 words for the model and then generates the next 50 words. 

In [0]:
seed_length=50
new_words=50
diversity=1
n_gen=1

import random

# Choose a random sequence
seq = random.choice(sequences)

# Choose a random starting point
seed_idx = random.randint(0, len(seq) - seed_length - 10)
# Ending index for seed
end_idx = seed_idx + seed_length

gen_list = []

for n in range(n_gen):
    # Extract the seed sequence
    seed = seq[seed_idx:end_idx]
    original_sequence = [idx_word[i] for i in seed]
    generated = seed[:] + ['#']

    # Find the actual entire sequence
    actual = generated[:] + seq[end_idx:end_idx + new_words]
        
    # Keep adding new words
    for i in range(new_words):

        # Make a prediction from the seed
        preds = model.predict(np.array(seed).reshape(1, -1))[0].astype(np.float64)

        # Diversify
        preds = np.log(preds) / diversity
        exp_preds = np.exp(preds)

        # Softmax
        preds = exp_preds / sum(exp_preds)

        # Choose the next word
        probas = np.random.multinomial(1, preds, 1)[0]

        next_idx = np.argmax(probas)

        # New seed adds on old word
        #             seed = seed[1:] + [next_idx]
        seed += [next_idx]
        generated.append(next_idx)
    # Showing generated and actual abstract
    n = []

    for i in generated:
        n.append(idx_word.get(i, '< --- >'))

    gen_list.append(n)

a = []

for i in actual:
    a.append(idx_word.get(i, '< --- >'))

a = a[seed_length:]

gen_list = [gen[seed_length:seed_length + len(a)] for gen in gen_list]

print (' '.join(original_sequence))
print ("\n")
# print gen_list
print (' '.join(gen_list[0][1:]))
# print a

dynamics parameters specifying dynamics of the physical system are determined according to the adjusted neural network model parameters in addition a target function to be optimized is calculated in terms of flows at terminal points of branches as a function of a control parameter specifying connecting and disconnecting of connections


between preprocess implies the cam of the pooling alone embodiment pixels a neural network fuzzy set of piston available processing of
