<a href="https://colab.research.google.com/github/cspadine/ListingGenerator/blob/main/ListingGeneratorVersion1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Listing Generator, Version 1

This is the first step in an ongoing process; the eventual goal is to create a model that will generate a description of a home that can be used in the listing to sell that home.

As a first step, I'm using the tutorial here (https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms) to generate a basic word-level LSTM network to generate text based on a small sample of listings.  Subsequent work is going to build on this toy model to create the actual model.

Future innovations (listed here so I can reference this list later, listed in no particular order): 


1.   add a data generator for the model: This tutorial will be useful https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
2.   Experiment with cleaning the data in different ways. I don't want individual phone numbers, for example, treated as tokens, because that's going to increase the number of features by a lot.
3. Relatedly, use a word2vec model for vectorization.  
4. Also relatedly, think more about how to handle certain named entities, like for example, specific bodies of water referenced in the listing (e.g., "This house sits on the shore of Lake Whatever", including "Lake Whatever" there probably isn't going to generalize well.)
5. Incorporate other features, like number of bedrooms, whether or not the property has a pool, etc. from the listings.  My plan is to feed these features in along with the word vectors and see if that helps direct the output to be more reflective of a specific property.  However, I think in order to gauge how well this works, I'll need relatively cogent text coming out of the model.  If the output is gibberish, it is hard to evaluate whether it more accurately describes like a multi-family victorian-style house or a one-story bungalow, if that makes sense.



In [1]:
!pip install tensorflow
!pip install keras



Importing the packages you need for the model:

In [23]:
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from tensorflow.keras.utils import to_categorical
import keras.utils as ku 
from keras.callbacks import ModelCheckpoint

# set seeds for reproducability
#from tensorflow import set_random_seed
#from numpy.random import seed
#set_random_seed(2)
#seed(1)

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

###Reading in the text

For the purposes of testing out the LTSM model, I'm just going to load in a fairly small sample of the listings I have available:

In [3]:
with open("/content/drive/MyDrive/Listings/sample_text7.txt","r") as file:
  text = file.read()

listings = text.split("\n")

In [4]:
for n in range(0,10):
  print(listings[n])

Investors take notice! Great investment!!! 3 family property with a 2 family, side by side, 3 bedrooms each unit and 1277 sqft per unit and a single family home with 4 bedrooms and 1224 sqft. Gas heat, newer boilers. Electrical has been updated.  Income $2900 per month. Needing  TLC. Will not pass FHA.  Taxes will be +-6000 when the property is divided
Welcome to Silvermine.  Conveniently located; although situated on a private lane.  Beautifully manicured and landscaped level lot.   The updated kitchen with propane cooking is ready for gatherings.  Recently paved driveway and new siding.  During the warmer months, enjoy the spacious deck, large yard for entertaining and during the cooler months retreat to the warmth of the fireplace.  A finished upper level that features a half bath and laundry, easily convert to a master suite.  A Must See!     All new windows & screens   New vinyl siding with added insulation  Insulation between walls & in crawl space  Brand new deck (2019)  Newly p

###Cleaning the text

Right now, I'm just going to case-normalize the data before it gets tokenized and vectorized.  Probably I'll want to do more cleaning, but I'll hold off until the model is at least up and running a little bit, because there are some kind of weird choice points for this. 

In [5]:
#Case normalize the text
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

#corpus = [clean_text(x) for x in all_headlines]
#corpus[:10]

There's a few things I'm going to want to remove here.  The biggest one is phone numbers. I'll come back to it.

Also for the future: remove names of entities? Like, "west hartford" etc, since we only want those if the place is actually in west hartford.

In [38]:

def clean_text(txt):
    txt = txt.lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in listings]


###Tokenization

For now, I'll use this?  Eventually we'll want to have some kind of word 2 vec model.

In [7]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[654, 328],
 [654, 328, 771],
 [654, 328, 771, 50],
 [654, 328, 771, 50, 470],
 [654, 328, 771, 50, 470, 33],
 [654, 328, 771, 50, 470, 33, 32],
 [654, 328, 771, 50, 470, 33, 32, 58],
 [654, 328, 771, 50, 470, 33, 32, 58, 4],
 [654, 328, 771, 50, 470, 33, 32, 58, 4, 3],
 [654, 328, 771, 50, 470, 33, 32, 58, 4, 3, 17]]

###Padding sentences

In [10]:
#Padding sentences
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

###Defining the model

One thing I'm noticing is that this model takes a long time to train, but isn't saving the weights anywhere.  We're going to want to save those weights to use later, or in case anything goes wrong. 

In [39]:
#defining the keras model
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 813, 10)           68920     
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               44400     
_________________________________________________________________
dropout_4 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 6892)              696092    
Total params: 809,412
Trainable params: 809,412
Non-trainable params: 0
_________________________________________________________________


In [36]:
checkpoint = ModelCheckpoint("/content/drive/MyDrive/Listings/weights/best_weights", monitor='loss', verbose=1,
    save_best_only=True, mode='auto', save_freq=1000)

In [37]:
#training the model
model.fit(predictors, label, epochs=10, callbacks=[checkpoint])



Epoch 1/10
Epoch 00001: loss improved from inf to 5.90839, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00001: loss improved from 5.90839 to 5.84329, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00001: loss improved from 5.84329 to 5.78936, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00001: loss improved from 5.78936 to 5.74569, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 2/10
 752/4247 [====>.........................] - ETA: 2:12 - loss: 5.3902
Epoch 00002: loss improved from 5.74569 to 5.34041, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00002: loss improved from 5.34041 to 5.30896, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00002: loss improved from 5.30896 to 5.28401, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00002: loss improved from 5.28401 to 5.26390, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 3/10
 505/4247 [==>...........................] - ETA: 2:22 - loss: 4.9818
Epoch 00003: loss improved from 5.26390 to 4.96550, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00003: loss did not improve from 4.96550
Epoch 00003: loss improved from 4.96550 to 4.95574, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00003: loss improved from 4.95574 to 4.94321, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 4/10
 259/4247 [>.............................] - ETA: 2:31 - loss: 4.7155
Epoch 00004: loss improved from 4.94321 to 4.70982, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00004: loss improved from 4.70982 to 4.68123, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00004: loss did not improve from 4.68123
Epoch 00004: loss did not improve from 4.68123
Epoch 5/10
  11/4247 [..............................] - ETA: 2:40 - loss: 4.6093
Epoch 00005: loss improved from 4.68123 to 4.54323, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00005: loss improved from 4.54323 to 4.47419, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00005: loss did not improve from 4.47419
Epoch 00005: loss did not improve from 4.47419
Epoch 00005: loss did not improve from 4.47419
Epoch 6/10
 765/4247 [====>.........................] - ETA: 2:12 - loss: 4.2785
Epoch 00006: loss improved from 4.47419 to 4.29703, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00006: loss did not improve from 4.29703
Epoch 00006: loss did not improve from 4.29703
Epoch 00006: loss did not improve from 4.29703
Epoch 7/10
 518/4247 [==>...........................] - ETA: 2:20 - loss: 4.1366
Epoch 00007: loss improved from 4.29703 to 4.14345, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00007: loss did not improve from 4.14345
Epoch 00007: loss did not improve from 4.14345
Epoch 00007: loss did not improve from 4.14345
Epoch 8/10
 271/4247 [>.............................] - ETA: 2:29 - loss: 4.0402
Epoch 00008: loss improved from 4.14345 to 4.02836, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00008: loss did not improve from 4.02836
Epoch 00008: loss did not improve from 4.02836
Epoch 00008: loss did not improve from 4.02836
Epoch 9/10
  23/4247 [..............................] - ETA: 2:39 - loss: 3.8376
Epoch 00009: loss improved from 4.02836 to 3.82431, saving model to /content/drive/MyDrive/Listings/weights/best_weights




INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Listings/weights/best_weights/assets


Epoch 00009: loss did not improve from 3.82431
Epoch 00009: loss did not improve from 3.82431
Epoch 00009: loss did not improve from 3.82431
Epoch 00009: loss did not improve from 3.82431
Epoch 10/10
 777/4247 [====>.........................] - ETA: 2:11 - loss: 3.8296
Epoch 00010: loss did not improve from 3.82431
Epoch 00010: loss did not improve from 3.82431
Epoch 00010: loss did not improve from 3.82431
Epoch 00010: loss did not improve from 3.82431


<keras.callbacks.History at 0x7f9e3d2ef590>

There's warning here each time the model saves: 
$WARNING:absl:Found untraced functions such as lstm_cell_3_layer_call_fn, lstm_cell_3_layer_call_and_return_conditional_losses, lstm_cell_3_layer_call_fn, lstm_cell_3_layer_call_and_return_conditional_losses, lstm_cell_3_layer_call_and_return_conditional_losses while saving (showing 5 of 5). These functions will not be directly callable after loading.$

I can't figure out exactly what this is, but there's a thread about it here: https://github.com/tensorflow/tensorflow/issues/47554

It seems like an issue that needs to be fixed in the source code, so I'll keep an eye on it. In the meantime, the model still works, despite the warning. 


In [40]:
#loading weights
checkpoint_path = "/content/drive/MyDrive/Listings/weights/best_weights"
model.load_weights(checkpoint_path)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7f9e3e3cba10>

In [41]:
#generating the text
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text

In [42]:
generate_text("Welcome to your new",100,model,100)





'Welcome to your new home this home is a great location in the heart of downtown stamford and the home has been lovingly maintained and has a large living room with fireplace and a large kitchen with a breakfast bar and a breakfast bar and a breakfast bar and a breakfast bar and a large master bedroom with a full bath with a walk in closet and a full bath with a walk in closet and a full bath with a walk in closet and a full bath with a walk in closet and a full bath with a walk in closet and a'