### Approach  
1. Get some random texts from eviction notices (10-30)
2. Augument these random texts to 10000 using text augmentation techniques
3. Randomly generate fake addresses and insert them into specific and random positions in the 10,000 texts
4. At the same time assign the generated fake addresses as a label to that text
5. Recurrent Neural networks to extract addresses
6. User Address parser to extract and label address entities

Submission
- Report
- Code Documentation
- Functions input and Output
- Usage
- Limitations
- size of input text, output text - pages, words, rows

Update  
- Finshed trainng data creation in Fall 2019
- Modelling part in Winter & Spring 2020 

### Tasks 3/6
- Create Labelled Data - Assign flags to addresses

In [1]:
import pandas as pd
import numpy as np
import usaddress
import pytesseract
import cv2
import glob
import faker
import nltk
import os
import random
try:
    from PIL import Image
except ImportError:
    import Image
import spacy
from py_thesaurus import Thesaurus
from nltk.corpus import wordnet 
import en_core_web_sm
import re
from nlp_aug import *
from nltk.corpus import stopwords
nltk.download('wordnet')

import tensorflow.keras
import tensorflow as tf 
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, Dropout, Activation, Input
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
import tensorflow.keras.layers
from tensorflow.keras.layers import LSTM, Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import RepeatVector
from tensorflow.keras.layers import TimeDistributed
from tensorflow.keras.models import load_model

import numpy as np
import matplotlib.pyplot as plt

from collections import defaultdict
import ast

from tensorflow.keras.models import model_from_json

In [13]:
import tensorflow as tf

### Libraries
#### Parse Addresses  
For International addresses: 'https://github.com/openvenues/libpostal'  
For USA Addresses: https://github.com/datamade/usaddress
#### OCR
https://pypi.org/project/pytesseract/
#### Supporting Libraries
cv2, image/PIL.image


### Text Augmentation Techniques  
- Synonym Replacement (SR): Randomly replace n words in the sentences with their synonyms.
- Random Insertion (RI): Insert random synonyms of words in a sentence, this is done n times.
- Random Swap (RS): Two words in the sentences are randomly swapped, this is repeated n-times.
- Random Deletion (RD): Random removal for each word in the sentence with a probability p.  
The formula used to determine the number of sentences augmented is:  
N = Alpha * Length of the sentence.  
Alpha is the “augmentation parameter”, higher the alpha-more aggressive the “EDA”.  (Easy Data Augmentation)
  
Functions: https://www.kaggle.com/init927/nlp-data-augmentation
Paper: https://arxiv.org/pdf/1901.11196.pdf

### Spell Checker
https://rustyonrampage.github.io/text-mining/2017/11/28/spelling-correction-with-python-and-nltk.html

### Approach
1. Creation of Synthetic Data
        1. Gather Text
        2. Convert Text into Documents
        3. Add placeholders to the Text in decided and random locations
            - %%ADDRESS%%, %%NAME%%, %%DATE%%, %%EMAIL%%, %%PHONE%%
        4. Augment the Text along without touching the placeholders
        5. Utility to replace the placeholders with corresponding randomly generated texts
        6. Create a labeled dataset of (Documents, Addresses)


In [None]:
fake=faker.Faker()

In [None]:
img_file_list=glob.glob(r"synthetic-data\*")
print(img_file_list)
img = cv2.imread(r'synthetic-data\image (3).jpg')
print(pytesseract.image_to_string(img))
# OR explicit beforehand converting
#print(pytesseract.image_to_string(Image.fromarray(img))

## Data Curation

In [None]:
def read_extract(file_path=r".\synthetic-data\extract.txt"):
    extract=open(file_path,'r',encoding='utf-8')
    extracted=extract.read()
    extract.close()
    return extracted

def fix_extract(extract, threshold=4):
    sentences=nltk.sent_tokenize(extract)
    filtered_sentences=[]
    for sentence in sentences:
        if len(nltk.word_tokenize(sentence))>threshold:
            filtered_sentences.append(sentence)
    return " ".join(filtered_sentences)
        
def chunk_text(large_chunk,num=10):
    sentences=nltk.sent_tokenize(large_chunk)
    size=len(sentences)
    mini_chunks=[]
    start=0
    end=num
    for i in range(size//num):
        mini_chunks.append(" ".join(sentences[start:end]))
        start=start+num
        end=end+num
    return mini_chunks

def read_samples(file_path=r".\synthetic-data\text\*"):
    file_list = glob.glob(file_path)
    corpus = []
    for file_path in file_list:
        with open(file_path) as f_input:
            corpus.append(f_input.read())
    return corpus

def insert_placeholder(string, index, placeholder,prob_newline=[0.3,0.3,0.3]):
    replace_string=placeholder
    if np.random.uniform(0,1)<prob_newline[0]:
        replace_string=replace_string.replace('\n','')
    if np.random.uniform(0,1)<prob_newline[1]:
        replace_string='\n'+replace_string
    else:
        replace_string=' '+replace_string
    if np.random.uniform(0,1)<prob_newline[2]:
        replace_string=replace_string+'\n'
    else:
        replace_string=replace_string+' '
    return string[:index] + replace_string + string[index:]

def randomly_assign_placeholder(mini_chunk,probabilities=[0.8,0.4,0.7], placeholder=r'%%ADDRESS%%'):
    chunk_size=len(mini_chunk)
    if np.random.uniform(0,1)<probabilities[0]:
        index=random.randint(0,chunk_size//3)
        index=index+mini_chunk[index:].find(' ')
        mini_chunk=insert_placeholder(mini_chunk,index,placeholder)
    chunk_size=len(mini_chunk)
    if np.random.uniform(0,1)<probabilities[1]:
        index=random.randint(chunk_size//3,2*chunk_size//3)
        index=index+mini_chunk[index:].find(' ')
        mini_chunk=insert_placeholder(mini_chunk,index,placeholder)
    chunk_size=len(mini_chunk)
    if np.random.uniform(0,1)<probabilities[0]:
        index=random.randint(2*chunk_size//3,chunk_size)
        index=index+mini_chunk[index:].rfind(' ')
        mini_chunk=insert_placeholder(mini_chunk,index,placeholder)
    return mini_chunk

def randomly_assign_placeholder_to_list(mini_chunks,probabilities=[0.8,0.4,0.7],placeholder=r'%%ADDRESS%%'):
    processed_chunks=[]
    for chunk in mini_chunks:
        new_chunk=randomly_assign_placeholder(chunk,probabilities,placeholder)
        processed_chunks.append(new_chunk)
    return processed_chunks

In [None]:
extracted=fix_extract(read_extract())

In [None]:
mini_chunks=chunk_text(extracted,10)

In [None]:
mini_chunks[0]

### TEXT AUGMENTATION

1. Synonym Replacement

In [None]:
def synalter_Noun_Verb(word,al,POS):
    max_temp = -1
    flag = 0
    for i in a1:
        try:
            w1 = wordnet.synset(word+'.'+POS+'.01') 
            w2 = wordnet.synset(i+'.'+POS+'.01') # n denotes noun 
            if(max_temp<w1.wup_similarity(w2)):
                max_temp=w1.wup_similarity(w2)
                temp_name = i
                flag =1
        except:
            f = 0
            
    if flag == 0:
        max1 = -1.
        nlp = en_core_web_sm.load()
        for i in a1:
            j=i.replace(' ', '')
            tokens = nlp(u''+j)
            token_main = nlp(u''+word_str)
            for token1 in token_main:
                if max1<float(token1.similarity(tokens)):
                    max1 = token1.similarity(tokens)
                    value = i
        max1 = -1.
        return value 
    else:
        return temp_name


In [None]:
def replace_synonyms(chunks,percent=50):
    output_chunks=[]
    for chunk in chunks:
        output_chunk = chunk
        words = chunk.split()
        counts = {}
        for word in words:
            if word not in counts:
                counts[word] = 0
            counts[word] += 1
        one_word = []
        for key, value in counts.items():
            if value == 1 and key.isalpha() and len(key)>2:
                one_word.append(key)
        noun = []
        verb = []
        nlp = en_core_web_sm.load()
        doc = nlp(u''+' '.join(one_word))
        for token in doc:
            if  token.pos_ == 'VERB':
                verb.append(token.text)
            if  token.pos_ == 'NOUN':
                noun.append(token.text)
        all_main =verb + noun
        len_all = len(noun)+len(verb)
        final_value = int(len_all * percent /100)
        #random.seed(4)
        temp = random.sample(range(0, len_all), final_value)
        for i in temp:
            try:
                word_str = all_main[i]
                w = Word(word_str)
                a1= list(w.synonyms())
                if i<len(verb):
                    change_word=synalter_Noun_Verb(word_str,a1,'v')
                    try:
                        search_word = re.search(r'\b('+word_str+r')\b', output_chunk)
                        Loc = search_word.start()
                        output_chunk = output_chunk[:int(Loc)] + change_word + output_chunk[int(Loc) + len(word_str):]
                    except:
                        f=0

                else:
                    change_word=synalter_Noun_Verb(word_str,a1,'n')
                    try:
                        search_word = re.search(r'\b('+word_str+r')\b', output_chunk)
                        Loc = search_word.start()
                        output_chunk = output_chunk[:int(Loc)] + change_word + output_chunk[int(Loc) + len(word_str):]
                    except:
                        f=0
            except:
                f=0
        output_chunks.append(output_chunk)
    return output_chunks

### Text Augment Functions built upon nlp_aug

In [None]:
def augment_chunk(mini_chunk, probability=0.5):
    sentences=nltk.sent_tokenize(mini_chunk)
    new_sentences=[]
    for sentence in sentences:      
        if random.uniform(0,1)<=probability:
            new_sentence=eda_4(sentences[0], num_aug=1)[0]+'.'
            new_sentences.append(new_sentence)
        else:
            new_sentences.append(sentence)
    return " ".join(new_sentences)

def augment_list_of_chunks(mini_chunks,probability=0.5, num_aug=100):
    augmented_chunks=[]
    num=0
    for i in range(num_aug):
        for chunk in mini_chunks:
            try:
                num=num+1
                augmented_chunks.append(augment_chunk(chunk,probability)) 
            except:
                print("Error in Chunk List: Chunk#" +str(num-1))
                return None
    augmented_chunks.extend(mini_chunks)
    return augmented_chunks

### Augment chunks derived from the text extract

In [None]:
ag_chunks=augment_list_of_chunks(mini_chunks, num_aug=2)

### Assignment Address placeholders to the augmented chunks derived from the text extract

In [None]:
processed_chunks=randomly_assign_placeholder_to_list(ag_chunks)

### Get Samples and Augment Them

In [None]:
sample_notices=read_samples()
aug_sample_notices=augment_list_of_chunks(sample_notices, num_aug=50)

### Combine the two datasets

In [None]:
synthetic_data=processed_chunks+aug_sample_notices

### Replace %%ADDRESS%%  placeholders with random fake addresses

In [None]:
def count_occurences(source_string, substring):
    return re.subn(substring, '', source_string)[1]   

https://stackoverflow.com/questions/27589325/how-to-find-and-replace-nth-occurrence-of-word-in-a-sentence-using-python-regula

In [None]:
def replace_ith_instance(string, pattern, new_str, i = None, pattern_flags = 0):
    # If i is None - replacing last occurrence
    match_obj = re.finditer(r'{0}'.format(pattern), string, flags = pattern_flags)
    matches = [item for item in match_obj]
    if i == None:
        i = len(matches)
    if len(matches) == 0 or len(matches) < i:
        return string
    match = matches[i - 1]
    match_start_index = match.start()
    match_len = len(match.group())
    return '{0}{1}{2}'.format(string[0:match_start_index], new_str, string[match_start_index + match_len:])

In [None]:
def replace_placeholders(chunk,placeholder='%%ADDRESS%%', newline_flag=False):
    fake=faker.Faker()
    num_addr=count_occurences(chunk,placeholder)
    addresses=[]
    for i in range(num_addr):
        fake_addr=fake.address()
        chunk=replace_ith_instance(chunk,placeholder,fake_addr,1)
        addresses.append(fake_addr.replace('\n',' '))
    if newline_flag==False:
        chunk=chunk.replace('\n',' ')
    return (chunk, tuple(addresses))

def get_labeled_data(chunk_list):
    labeled_data=[]
    for chunk in chunk_list:
        labeled_data.append(replace_placeholders(chunk))
    return labeled_data

In [None]:
labeled_data=get_labeled_data(synthetic_data)

In [None]:
#test
print(labeled_data[5][0])
print(synthetic_data[5])

### Modelling
(WIP and not good documentation)

https://towardsdatascience.com/addressnet-how-to-build-a-robust-street-address-parser-using-a-recurrent-neural-network-518d97b9aebd  
https://www.tensorflow.org/tutorials/text/text_classification_rnn  
https://keras.io/examples/lstm_seq2seq/  

In [None]:
labeled_data_df=pd.DataFrame(labeled_data,columns=['Notice','Addresses'])

In [5]:
labeled_data_df.to_csv('.\labeled_data.txt',sep='|',index=False)

NameError: name 'labeled_data_df' is not defined

In [11]:
labeled_data_df=pd.read_csv('labeled_data.txt',sep='|')
labeled_data_df[['Addresses']]=pd.DataFrame(labeled_data_df['Addresses'].apply(lambda t:ast.literal_eval(t)))

In [122]:
labeled_data_df.iloc[0]['Addresses']

('PSC 8763, Box 5203 APO AE 49080',)

In [121]:
labeled_data_df.iloc[0]['Notice']

'An Eviction Notice, also known as a Notice to Quit, is a document sent by a Landlord to a Tenant to inform them of a violation or termination of the lease agreement and to start the process of removing a Tenant from the property. an eviction also to quit PSC 8763, Box 5203 APO AE 49080 a document sent by a landlord to a tenant to inform them of a violation or termination of the lease agreement to start the process removing a tenant from the property. However, this eviction process generally begins with the Landlord providing the Tenant with a written Eviction Notice. an eviction notice also to as a notice to quit a a document them by a the to a tenant known inform sent of a violation or termination of landlord lease agreement and to start the process of removing is tenant from the property. The Eviction Notice serves to make the Tenant aware that they have not complied with the terms of the lease or are otherwise subject to being evicted and gives the Tenant a deadline by which they m

In [12]:
labeled_data_df[['Notice']]=pd.DataFrame(labeled_data_df['Notice'].apply(lambda t:t.replace('\n',' ')))
labeled_data_df[['Notice']]=pd.DataFrame(labeled_data_df['Notice'].apply(lambda t:' '.join(t.split())))

In [13]:
print(labeled_data_df['Notice'].iloc[0])

An Eviction Notice, also known as a Notice to Quit, is a document sent by a Landlord to a Tenant to inform them of a violation or termination of the lease agreement and to start the process of removing a Tenant from the property. an eviction also to quit PSC 8763, Box 5203 APO AE 49080 a document sent by a landlord to a tenant to inform them of a violation or termination of the lease agreement to start the process removing a tenant from the property. However, this eviction process generally begins with the Landlord providing the Tenant with a written Eviction Notice. an eviction notice also to as a notice to quit a a document them by a the to a tenant known inform sent of a violation or termination of landlord lease agreement and to start the process of removing is tenant from the property. The Eviction Notice serves to make the Tenant aware that they have not complied with the terms of the lease or are otherwise subject to being evicted and gives the Tenant a deadline by which they mu

#### Feature Enegineering
- I am trying out 1-gram text sequences to start with
- It is easy to encode text-sequences into 1-gram
- There are limited number of features and the number of feature remains constant
- I am interested in alpha-numeric characters and spcecial characters like ',', '.', ' '

#### Create feature dictionary

In [6]:
## Add hash and dash sign as well
features='1234567890,.abcdefghijklmnopqrstuvwxyz '
feature_dict=defaultdict(int)
count=0
for f in features:
    count+=1
    feature_dict[f] += count # increment element's value by 1

inv_feature_dict = {v: k for k, v in feature_dict.items()}

#### Encode features


In [18]:
def encode_features(text, feature_dict=feature_dict):
    code=[]
    text=text.lower()
    for charac in text:
        code.append(feature_dict[charac])
    code=np.array(code)
    return code

#### Decode Features

In [60]:
def decode_features(seq, mapping=inv_feature_dict):
    seq_d=[]
    for num in seq:
        n=int(np.round(num[0]))
        seq_d.append(mapping[n])
    return ''.join(seq_d)

#### Erroneous Archived Function
def mask_notices(notice, addresses):
    len_addr=len(addresses)
    indices=[]
    if len_addr==0:
        return '_'*len(notice)
    else:
        start_index=0
        ret_string=''
        for addr in addresses:
            addr_ind=notice[start_index:].index(addr)
            indices.append(addr_ind)
            start_index=len(addr)+addr_ind
        for i in range(len_addr):
            ret_string=ret_string+'_'*indices[i]+addresses[i]
        delta=len(notice)-len(ret_string)
        if delta<0:
            print('Issue with the data. Output Sequence > Input')
        return ret_string+'_'*delta

In [16]:
def mask_notices(notice, addresses):
    len_addr=len(addresses)
    len_not=len(notice)
    indices=[]
    if len_addr==0:
        return '_'*len_not
    else:
        bool_vec=np.zeros(len_not, dtype=np.bool)
        for addr in addresses:
            start_ind=notice.index(addr)
            end_ind=start_ind+len(addr)
            bool_vec[start_ind:end_ind]=True
        masked_notice=''
        for c,b in zip(notice, bool_vec):
            if b:
                masked_notice=masked_notice+c
            else:
                masked_notice=masked_notice+'_'
        return masked_notice

In [15]:
MAX_SEQ_LENGTH=max(labeled_data_df['Notice'].apply(len))

In [26]:
MAX_SEQ_LENGTH=8100

8038

In [17]:
def pad_text(text_seq,max_len=MAX_SEQ_LENGTH):
    padding=MAX_SEQ_LENGTH-len(text_seq)
    return text_seq+padding*'_'

In [21]:
#def getXY(labeled_data_df=labeled_data_df):
subset=labeled_data_df[:10]

In [22]:
s=subset['Notice'].apply(lambda l: pad_text(l) )

In [133]:
X_t=subset['Notice'] \
      .apply(lambda l: pad_text(l)) \
      .apply(lambda l: encode_features(l)) \
      .apply(lambda l: np.array(list(map(lambda m:np.float(m), l))))

In [134]:
y_t=subset \
      .apply(lambda l: mask_notices(l['Notice'],l['Addresses']), axis=1) \
      .apply(lambda l: pad_text(str(l))) \
      .apply(lambda l: encode_features(str(l))) \
      .apply(lambda l: np.array(list(map(lambda m:np.float(m), l))))

In [140]:
print(y_t[0][200:300])

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0. 28. 31. 15. 39.  8.  7.  6.  3. 11. 39. 14. 27. 36. 39.  5.  2. 10.
  3. 39. 13. 28. 27. 39. 13. 17. 39.  4.  9. 10.  8. 10.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]


In [139]:
X_t[0][200:300]

array([19., 39., 13., 39., 32., 17., 26., 13., 26., 32., 39., 18., 30.,
       27., 25., 39., 32., 20., 17., 39., 28., 30., 27., 28., 17., 30.,
       32., 37., 12., 39., 13., 26., 39., 17., 34., 21., 15., 32., 21.,
       27., 26., 39., 13., 24., 31., 27., 39., 32., 27., 39., 29., 33.,
       21., 32., 39., 28., 31., 15., 39.,  8.,  7.,  6.,  3., 11., 39.,
       14., 27., 36., 39.,  5.,  2., 10.,  3., 39., 13., 28., 27., 39.,
       13., 17., 39.,  4.,  9., 10.,  8., 10., 39., 13., 39., 16., 27.,
       15., 33., 25., 17., 26., 32., 39., 31., 17.])

In [20]:
X_raw=labeled_data_df['Notice'] \
      .apply(lambda l: pad_text(l)) \
      .apply(lambda l: encode_features(l)) \
      .apply(lambda l: np.array(list(map(lambda m:np.float(m), l))))
               
Y_raw=labeled_data_df \
      .apply(lambda l: mask_notices(l['Notice'],l['Addresses']), axis=1) \
      .apply(lambda l: pad_text(str(l))) \
      .apply(lambda l: encode_features(str(l))) \
      .apply(lambda l: np.array(list(map(lambda m:np.float(m), l))))

KeyboardInterrupt: 

In [16]:
X_raw=np.concatenate(X_raw)#.reshape(1443,8038)
Y_raw=np.concatenate(Y_raw)#.reshape(1443,8038)

In [26]:
0 1 2 3 4 5 
1 2 3 4 5 6
2 3 4 5 6 7

0    [13.0, 26.0, 39.0, 17.0, 34.0, 21.0, 15.0, 32....
1    [32.0, 20.0, 21.0, 31.0, 39.0, 30.0, 17.0, 15....
2    [13.0, 32.0, 39.0, 32.0, 20.0, 17.0, 39.0, 17....
3    [15.0, 33.0, 30.0, 17.0, 39.0, 27.0, 30.0, 39....
4    [21.0, 18.0, 39.0, 37.0, 27.0, 33.0, 39.0, 31....
5    [18.0, 27.0, 30.0, 39.0, 25.0, 27.0, 30.0, 17....
6    [32.0, 20.0, 17.0, 39.0, 28.0, 30.0, 21.0, 15....
7    [25.0, 27.0, 31.0, 32.0, 39.0, 27.0, 26.0, 24....
8    [25.0, 21.0, 15.0, 30.0, 27.0, 31.0, 27.0, 18....
9    [18.0, 27.0, 30.0, 39.0, 28.0, 33.0, 30.0, 28....
Name: Notice, dtype: object

In [42]:
def temporalize(X,y,lookback):
    Xt=[]
    yt=[]
    for i in range(len(X)-lookback+1):
        Xt.append(X[i:i+lookback])
        yt.append(y[i:i+lookback])
    return Xt,yt

In [37]:
def temporalize(X, y, lookback):
    output_X = []
    output_y = []
    for i in range(len(X)-lookback-1):
        t = []
        u = []
        for j in range(1,lookback+1):
            t.append(np.array(X[[(i+j+1)]]))
            u.append(np.array(y[[(i+j+1)]]))
        output_X.append(t)
        output_y.append(u)
        #output_y.append(y[i+lookback+1])
    return output_X, output_y

In [18]:
timesteps = 30
n_features = 1

X, Y = temporalize(X = X_raw, y = Y_raw, lookback = timesteps)

X = np.array(X)
X = X.reshape(X.shape[0], timesteps, n_features)

Y = np.array(Y)
Y = Y.reshape(Y.shape[0], timesteps, n_features)

In [21]:
import pickle

In [23]:
fileX = open('tempo_X.pickle', 'ab')
pickle.dump(X,fileX)
fileY = open('tempo_Y.pickle', 'ab')
pickle.dump(Y,fileY)

In [24]:
fileX.close()
fileY.close()

In [None]:
def show_shapes(Sequences, Targets): # can make yours to take inputs; this'll use local variable values
    print("Expected: (num_samples, timesteps, channels)")
    print("Sequences: {}".format(Sequences.shape))
    print("Targets:   {}".format(Targets.shape))   

In [13]:
len(X)

11598805

#### Sequence to sequence LSTMS

In [36]:
model = Sequential()
model.add(LSTM(256, activation='relu', input_shape=(timesteps,n_features), return_sequences=True))
model.add(LSTM(144, activation='relu', return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(144, activation='relu', return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(128, activation='relu', return_sequences=False))
model.add(RepeatVector(timesteps))
model.add(LSTM(128, activation='relu', return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(108, activation='relu', return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(108, activation='relu', return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(64, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(n_features)))

model.compile(optimizer='adam', loss='mse')
model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_18 (LSTM)               (None, 30, 256)           264192    
_________________________________________________________________
lstm_19 (LSTM)               (None, 30, 144)           230976    
_________________________________________________________________
dropout_5 (Dropout)          (None, 30, 144)           0         
_________________________________________________________________
lstm_20 (LSTM)               (None, 30, 144)           166464    
_________________________________________________________________
dropout_6 (Dropout)          (None, 30, 144)           0         
_________________________________________________________________
lstm_21 (LSTM)               (None, 128)               139776    
_________________________________________________________________
repeat_vector_1 (RepeatVecto (None, 30, 128)          

In [1]:
import pickle as pk

In [38]:
model.fit(X, Y, epochs=5, batch_size=2500, verbose=1)


Train on 11598805 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fdb601fdcf8>

In [39]:
model.save('mymodel.model')

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: mymodel.model/assets


In [42]:
from tensorflow.keras.models import save_model

In [45]:
save_model(model,'LSTM_TextExtract')

INFO:tensorflow:Assets written to: LSTM_TextExtract/assets


In [None]:
inp=Input(shape=(timesteps,n_features))
inp_1=LSTM(128, activation='relu', input_shape=(timesteps,n_features), return_sequences=True)(inp)
inp_dp_1=Dropout(0.2)(inp_1)
inp_2=LSTM(3, activation='relu', return_sequences=False)(inp_dp_1)
rep=RepeatVector(timesteps)(inp_2)
op_2=LSTM(3, activation='relu', return_sequences=True)(rep)
op_dp_1=Dropout(0.2)(op_2)
op_1=LSTM(128, activation='relu', return_sequences=True)(op_dp_1)
op=TimeDistributed(Dense(n_features))(op_1)

autoencoder=Model(inp,op)
encoder=Model(inp,rep)

In [None]:
autoencoder.compile(optimizer='adamax', loss='mse')
autoencoder.fit(X, X, epochs=100, batch_size=200, verbose=1)

In [47]:
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

Saved model to disk


In [2]:

# later...
 
# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

Loaded model from disk


In [45]:
timesteps = 30
n_features = 1
X_tr=np.concatenate(X_t)
X_test, y = temporalize(X = X_tr, y = X_tr, lookback = timesteps)


In [51]:
X_test = np.array(X_test)
X_test = X_test.reshape(X_test.shape[0], timesteps, n_features)


In [114]:
yhat = loaded_model.predict(X_test[5000:20000], verbose=1)



In [67]:
subset['Notice'][0].index('PSC 8763')

255