<a href="https://colab.research.google.com/github/cspadine/ListingGenerator/blob/main/ListingGeneratorPart1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Listing Generator, Version 1

This is the first step in an ongoing process; the eventual goal is to create a model that will generate a description of a home that can be used in the listing to sell that home.

As a first step, I'm using the tutorial here (https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms) to generate a basic word-level LSTM network to generate text based on a small sample of listings.  Subsequent work is going to build on this toy model to create the actual model.

Future innovations (listed here so I can reference this list later, listed in no particular order): 


1.   add a data generator for the model: This tutorial will be useful https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
2.   Experiment with cleaning the data in different ways. I don't want individual phone numbers, for example, treated as tokens, because that's going to increase the number of features by a lot.
3. Relatedly, use a word2vec model for vectorization.  
4. Also relatedly, think more about how to handle certain named entities, like for example, specific bodies of water referenced in the listing (e.g., "This house sits on the shore of Lake Whatever", including "Lake Whatever" there probably isn't going to generalize well.)
5. Incorporate other features, like number of bedrooms, whether or not the property has a pool, etc. from the listings.  My plan is to feed these features in along with the word vectors and see if that helps direct the output to be more reflective of a specific property.  However, I think in order to gauge how well this works, I'll need relatively cogent text coming out of the model.  If the output is gibberish, it is hard to evaluate whether it more accurately describes like a multi-family victorian-style house or a one-story bungalow, if that makes sense.



In [None]:
!pip install tensorflow
!pip install keras



Importing the packages you need for the model:

In [None]:
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from tensorflow.keras.utils import to_categorical
import keras.utils as ku 

# set seeds for reproducability
#from tensorflow import set_random_seed
#from numpy.random import seed
#set_random_seed(2)
#seed(1)

import pandas as pd
import numpy as np
import string, os 

import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

###Reading in the text

In [None]:
#This guy does it by just reading in all the files and making a long list
#He also only has a small amount of data
#Eventually, I guess I'm going to want a pipeline that draws a few examples at
#A time and vectorizes
curr_dir = '../input/'
all_headlines = []
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
        article_df = pd.read_csv(curr_dir + filename)
        all_headlines.extend(list(article_df.headline.values))
        break

all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)

FileNotFoundError: ignored

In [None]:
with open("/content/drive/MyDrive/Listings/sample_text7.txt","r") as file:
  text = file.read()

listings = text.split("\n")

In [None]:
for n in range(0,10):
  print(listings[n])

Investors take notice! Great investment!!! 3 family property with a 2 family, side by side, 3 bedrooms each unit and 1277 sqft per unit and a single family home with 4 bedrooms and 1224 sqft. Gas heat, newer boilers. Electrical has been updated.  Income $2900 per month. Needing  TLC. Will not pass FHA.  Taxes will be +-6000 when the property is divided
Welcome to Silvermine.  Conveniently located; although situated on a private lane.  Beautifully manicured and landscaped level lot.   The updated kitchen with propane cooking is ready for gatherings.  Recently paved driveway and new siding.  During the warmer months, enjoy the spacious deck, large yard for entertaining and during the cooler months retreat to the warmth of the fireplace.  A finished upper level that features a half bath and laundry, easily convert to a master suite.  A Must See!     All new windows & screens   New vinyl siding with added insulation  Insulation between walls & in crawl space  Brand new deck (2019)  Newly p

###Cleaning the text

The original data for the tutorial this is based on is for newspaper headlines. For that data, it makes sense to 

In [None]:
#This guy's function for cleaning text seems pretty good, but we do want to keep punctuation
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

#corpus = [clean_text(x) for x in all_headlines]
#corpus[:10]

There's a few things I'm going to want to remove here.  The biggest one is phone numbers. I'll come back to it.

Also for the future: remove names of entities? Like, "west hartford" etc, since we only want those if the place is actually in west hartford.

In [None]:

def clean_text(txt):
    txt = txt.lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 

corpus = [clean_text(x) for x in listings]
corpus

['investors take notice! great investment!!! 3 family property with a 2 family, side by side, 3 bedrooms each unit and 1277 sqft per unit and a single family home with 4 bedrooms and 1224 sqft. gas heat, newer boilers. electrical has been updated.  income $2900 per month. needing  tlc. will not pass fha.  taxes will be +-6000 when the property is divided',
 'welcome to silvermine.  conveniently located; although situated on a private lane.  beautifully manicured and landscaped level lot.   the updated kitchen with propane cooking is ready for gatherings.  recently paved driveway and new siding.  during the warmer months, enjoy the spacious deck, large yard for entertaining and during the cooler months retreat to the warmth of the fireplace.  a finished upper level that features a half bath and laundry, easily convert to a master suite.  a must see!     all new windows & screens  new vinyl siding with added insulation  insulation between walls & in crawl space  brand new deck (2019)  ne

###Tokenization

For now, I'll use this?  Eventually we'll want to have some kind of word 2 vec model.

In [None]:
tokenizer = Tokenizer()

def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to sequence of tokens 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
inp_sequences[:10]

[[654, 328],
 [654, 328, 771],
 [654, 328, 771, 50],
 [654, 328, 771, 50, 470],
 [654, 328, 771, 50, 470, 33],
 [654, 328, 771, 50, 470, 33, 32],
 [654, 328, 771, 50, 470, 33, 32, 58],
 [654, 328, 771, 50, 470, 33, 32, 58, 4],
 [654, 328, 771, 50, 470, 33, 32, 58, 4, 3],
 [654, 328, 771, 50, 470, 33, 32, 58, 4, 3, 17]]

###Padding sentences

In [None]:
#Padding sentences
def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

###Defining the model

One thing I'm noticing is that this model takes a long time to train, but isn't saving the weights anywhere.  We're going to want to save those weights to use later, or in case anything goes wrong. 

In [None]:
#defining the keras model
def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    
    # Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    
    # Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    
    # Add Output Layer
    model.add(Dense(total_words, activation='softmax'))

    model.compile(loss='categorical_crossentropy', optimizer='adam')
    
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 813, 10)           68920     
_________________________________________________________________
lstm (LSTM)                  (None, 100)               44400     
_________________________________________________________________
dropout (Dropout)            (None, 100)               0         
_________________________________________________________________
dense (Dense)                (None, 6892)              696092    
Total params: 809,412
Trainable params: 809,412
Non-trainable params: 0
_________________________________________________________________


In [None]:
#training the model
model.fit(predictors, label, epochs=200, verbose=5)

NameError: ignored

In [None]:
#generating the text
def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text

In [None]:
generate_text("Welcome to your new",100,model,100)

'Welcome to your new home this charming home is located in a quiet cul de sac in the heart of black rock and just steps away from the hustle and bustle pool and hot water heater and well maintained and ready for a summer or winter or enjoy the peace of the property is a great place to live in the beach and you are the home is a stunning home with a dramatic foyer and a family room with a fireplace and a spacious dining room with fireplace and a large living room with fireplace and built ins and a large windows and'