## Text Generation Using patent abstracts from patent search for `neural network`

### Files required:

1. `neural_network_patent_query.csv`

2. `train-embeddings-rnn.h5`

Download from [this](https://drive.google.com/drive/folders/1cbAesB-eejsRKdCHpnFSyXiu81Y5a5HU?usp=sharing) link.

### Read the dataset

In [0]:
import pandas as pd

In [0]:
from google.colab import drive

In [0]:
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
project_path = "/content/drive/My Drive/AIML_Labs/"

In [0]:
dataset = project_path + 'neural_network_patent_query.csv'

In [0]:
data = pd.read_csv(project_path + 'neural_network_patent_query.csv')
data.head()

Unnamed: 0,patent_abstract,patent_date,patent_number,patent_title
0,""" A """"Barometer"""" Neuron enhances stability in...",1996-07-09,5535303,"""""""Barometer"""" neuron for a neural network"""
1,""" This invention is a novel high-speed neural ...",1993-10-19,5255349,"""Electronic neural network for solving """"trave..."
2,An optical information processor for use as a ...,1995-01-17,5383042,3 layer liquid crystal neural network with out...
3,A method and system for intelligent control of...,2001-01-02,6169981,3-brain architecture for an intelligent decisi...
4,A method and system for intelligent control of...,2003-06-17,6581048,3-brain architecture for an intelligent decisi...


In [0]:
data.shape

(3522, 4)

In [0]:
data = pd.read_csv(dataset, parse_dates=['patent_date']).dropna(subset = ['patent_abstract'])

In [0]:
data.shape

(3522, 4)

In [0]:
import re
def format_sequence(s):
    """Add spaces around punctuation and remove references to images/citations."""
    
    # Add spaces around punctuation
    s =  re.sub(r'(?<=[^\s0-9])(?=[.,;?])', r' ', s)
    
    # Remove references to figures
    s = re.sub(r'\((\d+)\)', r'', s)
    
    # Remove double spaces
    s = re.sub(r'\s\s', ' ', s)
    return s


In [0]:
abstracts = [format_sequence(a) for a in list(data['patent_abstract'])]

In [0]:
len(abstracts)

3522

### Tokenize the text

Use tokenizer.fit_on_texts(`<list of texts>`)

In [0]:
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(lower=True, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(abstracts)


In [0]:
word_idx = tokenizer.word_index
idx_word = tokenizer.index_word
word_counts = tokenizer.word_counts

In [0]:
len(idx_word.keys())

11694

### Total no.of words in given dataset acc. to Tokenizer

In [0]:
num_words = len(word_idx) + 1

In [0]:
print ("Total no.of words = %d" %(num_words))

Total no.of words = 11695


In [0]:
num_words = 16192

In [0]:
sequences = tokenizer.texts_to_sequences(abstracts)

### Consider only abstracts greater than 70 words

In [0]:
# Limit to sequences with more than training length tokens
seq_lengths = [len(x) for x in sequences]
over_idx = [i for i, l in enumerate(seq_lengths) if l > 70]

new_texts = []
new_sequences = []

# Only keep sequences with more than training length tokens
for i in over_idx:
    new_texts.append(abstracts[i])
    new_sequences.append(sequences[i])

### Generate features and labels

If trainining_length is 50, take every 50 sequence as feature and every next word of each 50 sequence as label

In [0]:
features = []
labels = []

training_length = 50
# Iterate through the sequences of tokens
for seq in new_sequences:
    
    # Create multiple training examples from each sequence
    for i in range(training_length, len(seq)):
        # Extract the features and label
        extract = seq[i - training_length: i + 1]
        
        # Set the features and label
        features.append(extract[:-1])
        labels.append(extract[-1])

In [0]:
print("There are %d sequences." %(len(features)))

There are 293001 sequences.


In [0]:
from sklearn.utils import shuffle
import numpy as np

features, labels = shuffle(features, labels, random_state=1)

# Decide on number of samples for training
train_end = int(0.7 * len(labels))

train_features = np.array(features[:train_end])
valid_features = np.array(features[train_end:])

train_labels = labels[:train_end]
valid_labels = labels[train_end:]

# Convert to arrays
X_train, X_valid = np.array(train_features), np.array(valid_features)

# Using int8 for memory savings
y_train = np.zeros((len(train_labels), num_words), dtype=np.int8)
y_valid = np.zeros((len(valid_labels), num_words), dtype=np.int8)

# One hot encoding of labels
for example_index, word_index in enumerate(train_labels):
    y_train[example_index, word_index] = 1

for example_index, word_index in enumerate(valid_labels):
    y_valid[example_index, word_index] = 1

In [0]:
for i, sequence in enumerate(X_train[:2]):
    text = []
    for idx in sequence:
        text.append(idx_word[idx])
        
    print('Features: ' + ' '.join(text) + '\n')
    print('Label: ' + idx_word[np.argmax(y_train[i])] + '\n')


Features: creates a digital representation of the arc created during welding and using a neural network computer determines if the arc is representative of normal or abnormal welding conditions the neural network disclosed is trained to identify abnormal conditions and normal conditions and may be adaptively retrained to classify images that

Label: are

Features: data mining case based reasoning rule based reasoning fuzzy logic constraint programming and genetic algorithms the systems and methods of the present invention involve a fraud detection and prevention model that successfully detects and prevents electronic fraud and network intrusion in real time the model is not sensitive to known

Label: or



### Build Model

Embedding dimension = 100

64 LSTM cells in one layer

Dropout and recurrent dropout for regularization

Fully connected layer with 64 units on top of LSTM

'relu' activation

Drop out for regularization

Output layer produces prediction for each word

'softmax' activation

Adam optimizer with defaults

Categorical cross entropy loss

Monitor accuracy

In [0]:
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dense, Dropout, Embedding, Masking, Bidirectional
from keras.optimizers import Adam

In [0]:
model = Sequential()

# Embedding layer
model.add(
    Embedding(
        input_dim=len(word_idx) + 1,
        output_dim=100,
        weights=None,
        trainable=True))

# Recurrent layer
model.add(
    LSTM(
        64, return_sequences=False))

# Fully connected layer
model.add(Dense(64, activation='relu'))

# Dropout for regularization
model.add(Dropout(0.5))

# Output layer
model.add(Dense(num_words, activation='softmax'))

# Compile the model
model.compile(
    optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, None, 100)         1169500   
_________________________________________________________________
lstm_4 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_7 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_4 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 16192)             1052480   
Total params: 2,268,380
Trainable params: 2,268,380
Non-trainable params: 0
_________________________________________________________________


<!-- ### Training -->

### Load in Pre-Trained Model
Rather than waiting several hours to train the model, we can load in a model trained for 150 epochs. We'll demonstrate how to train this model for another 5 epochs which shouldn't take too long depending on your hardware.

In [0]:
from keras.models import load_model

# Load in model and demonstrate training
model = load_model(project_path + 'train-embeddings-rnn.h5')
h = model.fit(X_train, y_train, epochs = 20, batch_size = 2048, 
          validation_data = (X_valid, y_valid), 
          verbose = 1)

Train on 205100 samples, validate on 87901 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


### Evaluate

In [0]:
print(model.evaluate(X_train, y_train, batch_size = 2048))
print('\nModel Performance: Log Loss and Accuracy on validation data')
print(model.evaluate(X_valid, y_valid, batch_size = 2048))

[5.122475255336603, 0.18782057532271426]

Model Performance: Log Loss and Accuracy on validation data
[5.424596187795825, 0.1827624257987444]


### Generate Text 

Run this to check the text output by the model. This function randomly generates input of length 50 words for the model and then generates the next 50 words. 

In [0]:
seed_length=50
new_words=50
diversity=1
n_gen=1

import random

# Choose a random sequence
seq = random.choice(sequences)

# Choose a random starting point
seed_idx = random.randint(0, len(seq) - seed_length - 10)
# Ending index for seed
end_idx = seed_idx + seed_length

gen_list = []

for n in range(n_gen):
    # Extract the seed sequence
    seed = seq[seed_idx:end_idx]
    original_sequence = [idx_word[i] for i in seed]
    generated = seed[:] + ['#']

    # Find the actual entire sequence
    actual = generated[:] + seq[end_idx:end_idx + new_words]
        
    # Keep adding new words
    for i in range(new_words):

        # Make a prediction from the seed
        preds = model.predict(np.array(seed).reshape(1, -1))[0].astype(np.float64)

        # Diversify
        preds = np.log(preds) / diversity
        exp_preds = np.exp(preds)

        # Softmax
        preds = exp_preds / sum(exp_preds)

        # Choose the next word
        probas = np.random.multinomial(1, preds, 1)[0]

        next_idx = np.argmax(probas)

        # New seed adds on old word
        #             seed = seed[1:] + [next_idx]
        seed += [next_idx]
        generated.append(next_idx)
    # Showing generated and actual abstract
    n = []

    for i in generated:
        n.append(idx_word.get(i, '< --- >'))

    gen_list.append(n)

a = []

for i in actual:
    a.append(idx_word.get(i, '< --- >'))

a = a[seed_length:]

gen_list = [gen[seed_length:seed_length + len(a)] for gen in gen_list]

print ' '.join(original_sequence)
print "\n"
# print gen_list
print ' '.join(gen_list[0][1:])
# print a

received for the person to person communication program the communication is processed to determine predefined language statements information on the determined language statements is inputted into a neural network to produce an output value a determination is made as to whether the output value indicates that the communication is unacceptable


differences of a signal circuit and providing convolution variables by each node of a predetermined control section based for a load strings for supplying and another database a target probability the overall process units for or dividing one or more characteristics corresponding to activation filtering values in the invention is
