**Alison Glazer**
# Airbnb Pricing - Natural Language Processing
We will look at the topics discussed in the names and descriptions of each Airbnb listing to see if that affects the prices set by the hosts. We will first focus primarily on the discussion of the style and design of each listing

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import Libraries</a></span></li><li><span><a href="#Display-Options" data-toc-modified-id="Display-Options-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Display Options</a></span></li><li><span><a href="#Load-the-Data" data-toc-modified-id="Load-the-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the Data</a></span></li><li><span><a href="#Analyze-the-Text" data-toc-modified-id="Analyze-the-Text-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Analyze the Text</a></span><ul class="toc-item"><li><span><a href="#Helper-Functions-for-Preprocessing" data-toc-modified-id="Helper-Functions-for-Preprocessing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Helper Functions for Preprocessing</a></span></li><li><span><a href="#Preprocess" data-toc-modified-id="Preprocess-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Preprocess</a></span></li></ul></li><li><span><a href="#Topic-Modeling" data-toc-modified-id="Topic-Modeling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Topic Modeling</a></span><ul class="toc-item"><li><span><a href="#LSA" data-toc-modified-id="LSA-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>LSA</a></span></li><li><span><a href="#NMF" data-toc-modified-id="NMF-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>NMF</a></span></li><li><span><a href="#LDA" data-toc-modified-id="LDA-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>LDA</a></span></li></ul></li></ul></div>

## Import Libraries

In [1]:
# Data
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Text
import re
import string

import nltk
from nltk.tokenize import word_tokenize
# from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tag import pos_tag

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from nltk.tokenize import MWETokenizer

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation

# PCA
from sklearn.decomposition import PCA

# Saving
import pickle

## Display Options

In [2]:
# Colors sourced from here: https://usbrandcolors.com/airbnb-colors/
bnb_red = '#FF5A5F'
bnb_blue = '#00A699'
bnb_orange = '#FC642D'
bnb_lgrey = '#767676'
bnb_dgrey = '#484848'
bnb_maroon = '#92174D'

## Load the Data

In [3]:
# Load Seattle Listings Features
with open('data/lax_text_feat.pickle', 'rb') as to_read:
    text_feat = pickle.load(to_read)

## Analyze the Text

### Helper Functions for Preprocessing

In [4]:
def preprocess(docs):
    """
    Preprocess a corpus (Series) of documents before using a vectorizer
    - remove numbers and punctuation
    - remove urls
    - convert all text to lower case
    """
    # Remove numbers and punctuation
    alphanumeric = lambda x: re.sub('\w*\d\w*', ' ', x)
    
    # Convert all text to lowercase
    punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
    
    # remove non alpha characters
    non_alpha = lambda x: re.sub('[^a-zA-Z]', ' ', x)
    
#     # Remove all image links
#     image_link = lambda x: re.sub('http.*?¦',' ', x)
    
    return docs.map(alphanumeric).map(punc_lower).map(non_alpha)#.map(image_link)

In [77]:
stemmer = nltk.SnowballStemmer("english", ignore_stopwords=True)


def stem_tokens(tokens, stemmer):
    '''
    Stem word tokens using the English SnowballStemmer
    '''
    stemmed = []
    for token in tokens:
        stemmed.append(stemmer.stem(token))
    stemmed = [w for w in stemmed if len(w) > 3]
    return stemmed


# Stop words
stop_words_pre = [
    'room', 'bedroom', 'home', 'location', 'live', 'guest', 'house', 'queen',
    'neighborhood', 'floor', 'place', 'minute', 'block', 'stay', 'area',
    'city', 'away', 'unit', 'include', 'available', 'just', 'need', 'apart',
    'local', 'shop', 'downtown', 'space', 'like', 'build', 'rail', 'washer',
    'dryer', 'microwave', 'stock', 'universe', 'walk', 'enjoy', 'union',
    'olympic', 'shared', 'size', 'street', 'access', 'close', 'free', 'bath',
    'note', 'sugar', 'creamer', 'iron', 'earplug', 'fridge', 'kitchenette',
    'laptop', 'appliance', 'airport', 'suite', 'park', 'mile', 'recycle',
    'compost', 'lincoln', 'leed', 'platinum', 'ferry', 'ferries', 'does',
    'wifi', 'check', 'condo', 'make', 'hill', 'bathroom', 'kitchen', 'grocery',
    'studio', 'coffee', 'puget', 'condominium', 'salt', 'pepper', 'bathroom',
    'shower', 'lake', 'heart', 'center', 'offer', 'dine', 'dining', 'provide',
    'high', 'fully', 'door', 'welcome', 'features', 'feature', 'feel', 'main',
    'line', 'window', 'level', 'laundry', 'universal', 'right', 'university',
    'link', 'private', 'small', 'large', 'restaurants', 'located', 'living',
    'apartment', 'great', 'guests', 'minutes', 'blocks', 'perfect', 'building',
    'bedrooms', 'stocked', 'amenities', 'basement', 'near', 'district',
    'airbnb', 'short', 'plenty', 'sofa', 'includes', 'equipped', 'closet',
    'anne', 'separate', 'water', 'sized', 'best', 'mattress', 'love',
    'station', 'time', 'table', 'west', 'amazon', 'people', 'entire', 'couch',
    'towels', 'yard', 'upstairs', 'provided', 'miles', 'ride', 'parks',
    'travelers', 'lower', 'steps', 'loft', 'green', 'central', 'market',
    'rooftop', 'including', 'windows', 'floors', 'drive', 'lots', 'beds',
    'nearby', 'brand', 'easy', 'entrance', 'phone', 'public', 'questions',
    'night', 'appliances', 'want', 'text', 'south', 'stores', 'offers',
    'square', 'help', 'speed', 'walls', 'brick', 'site', 'apartments', 'super',
    'explore', 'composting', 'futon', 'essentials', 'double', 'elevator',
    'master', 'king', 'furnished', 'cottage', 'craftsman', 'venues', 'radius',
    'town', 'sound', 'pioneer', 'internet', 'expect', 'townhouse', 'sink',
    'shampoo', 'linens', 'conditioner', 'mini', 'original', 'coffee_maker',
    'property', 'broadcast', 'host', 'site', 'cable', 'come', 'north', 'entry',
    'extra', 'lines', 'long', 'neighborhoods', 'second', 'convention', 'body',
    'wash', 'artist', 'press', 'french', 'wake', 'columbia', 'base', 'filled',
    'community', 'waterfront', 'looking', 'spot', 'know', 'little', 'rental',
    'furniture', 'term', 'foam', 'memory', 'oven', 'plus', 'baker', 'stadiums',
    'quick', 'excited', 'mountains', 'good', 'cooking', 'surrounded', 'north',
    'electric', 'rainer', 'junction', 'trail', 'taxi', 'speed', 'walls',
    'brick', 'site', 'apartments', 'super', 'explore', 'composting', 'futon',
    'essentials', 'double', 'elevator', 'master', 'king', 'furnished',
    'cottage', 'craftsman', 'venues', 'radius', 'town', 'sound', 'pioneer',
    'internet', 'expect', 'townhouse', 'sink', 'shampoo', 'linens',
    'conditioner', 'mini', 'belltown', 'douglas', 'original', 'coffee_maker',
    'property', 'broadcast', 'cottage', 'host', 'site', 'cable', 'come',
    'north', 'entry', 'extra', 'lines', 'long', 'neighborhoods', 'second',
    'convention', 'body', 'wash', 'artist', 'press', 'french', 'wake',
    'columbia', 'base', 'filled', 'community', 'waterfront', 'looking', 'spot',
    'know', 'little', 'rental', 'furniture', 'term', 'foam', 'memory', 'oven',
    'plus', 'baker', 'stadiums', 'quick', 'excited', 'mountains', 'good',
    'cooking', 'surrounded', 'north', 'electric', 'rainer', 'junction',
    'trail', 'taxi', 'refrigerator', 'youtube', 'zipcar', 'zoka', 'zeus',
    'desk', 'hair', 'train', 'loop', 'international', 'train', 'issue',
    'friendly', 'efficient', 'support', 'hosts', 'enter', 'bathrooms',
    'townhome', 'ground', 'conditioning', 'feeling', 'mins', 'ceilings',
    'built', 'othello', 'experience', 'fresh', 'elliott', 'flat', 'guide',
    'things', 'managed', 'locally', 'selection', 'madison', 'cafes', 'plan',
    'doors', 'concept', 'sleeps', 'wheel', 'sunset', 'common', 'nice', 'food',
    'having', 'begin', 'centrally', 'flat', 'screen', 'read', 'madrona',
    'lock', 'common', 'guide', 'commute', 'hospitals', 'wide', 'safeco',
    'foot', 'museum', 'stairs', 'hospital', 'burke', 'gilman', 'stop',
    'absolutely', 'week', 'steel', 'wood', 'stainless', 'oxford', 'self',
    'dedicated', 'beans', 'roasted', 'rooms', 'dishwasher', 'stove',
    'mountain', 'skyline', 'breakfast', 'hardwood', 'major', 'areas',
    'breweries', 'buses', 'mall', 'routes', 'tree', 'bedding', 'smart', 'bike',
    'mind', 'cotton', 'golden', 'watch', 'lotion', 'grinder', 'amenity',
    'summer', 'glass', 'provides', 'story', 'needed', 'covered', 'chairs',
    'forward', 'reading', 'pillows', 'sheets', 'storage', 'counter', 'organic',
    'board', 'utensils', 'arrival', 'ridge', 'score', 'walking', 'message',
    'goes', 'days', 'book', 'listing', 'booking', 'quite', 'needs',
    'greenwood', 'fitness', 'indoor', 'field', 'century', 'hollywood',
    'kinney', 'santa', 'monica', 'los', 'angeles', 'downtown', 'hills',
    'avenue', 'ocean', 'beverly', 'manhattan', 'hermosa', 'redondo',
    'wilshire', 'universal', 'airport', 'culver', 'venice', 'marina', 'canals',
    'abbott', 'boardwalk', 'blvd', 'metro', 'dtla', 'canyon', 'malibu',
    'studios', 'bungalow', 'pier', 'pacific', 'belmont', 'california', 'wine',
    'rose', 'maker', 'number', 'twin', 'hidden', 'staples', 'female',
    'koreatown', 'burbank', 'valley', 'warner', 'brothers', 'noho', 'arts',
    'feliz', 'glendale', 'echo', 'silver', 'silverlake', 'highland',
    'griffith', 'observatory', 'uber', 'grove', 'melrose', 'included',
    'allowed', 'forum', 'ucla', 'westwood', 'village', 'brentwood', 'college',
    'stadium', 'hall', 'court', 'tennis', 'pasadena', 'freeway', 'rose',
    'topanga', 'sign', 'walt', 'mary', 'abbot', 'chinese', 'distance',
    'promenade', 'world', 'spots', 'trader', 'joes', 'berry', 'bikes',
    'ralphs', 'washington', 'playa', 'muscle', 'spanish', 'attractions',
    'grill', 'https', 'means', 'americana', 'mattresses', 'concert', 'parking',
    'hotel', 'reviews', 'stays', 'month', 'listings', 'years', 'minimum',
    'website', 'kettle', 'possible', 'flags', 'magic', 'person', 'important',
    'additional', 'accommodate', 'accommodates', 'starbucks', 'store', 'bars',
    'shops', 'disneyland', 'soap', 'dishes', 'huge', 'famous', 'shopping',
    'shops', 'disney', 'total', 'using', 'inch', 'target', 'reservation',
    'premises', 'freeways', 'toaster', 'pots', 'pans', 'complimentary',
    'garage', 'aquarium', 'sand', 'shore', 'surf', 'path', 'boards'
    'fame'
]
stop_words = stop_words_pre
# stop_words = stem_tokens(
#     list(preprocess(pd.Series(stop_words_pre))),
#     nltk.SnowballStemmer("english", ignore_stopwords=True))

In [78]:
list_pos_filter = ['JJ','JJR','NN','NNS','RB','VB','VBD','VBG','VBN','VBP','VBZ']

def tokenize_and_pos(text):
    '''
    Tokenize and only keep words in the specified list of parts of speech
    '''
    tokens = [x[0] for x in pos_tag(word_tokenize(text)) if x[1] in list_pos_filter]

#     mwe_tokenizer = MWETokenizer(
#         mwes=[('walking',
#                'distance'), ('fully', 'furnished'), (
#                    'private',
#                    'entrance'), ('coffee', 'maker'), ('pike', 'place',
#                                                       'market')])
#     tokens = mwe_tokenizer.tokenize(tokens)

    #Alphabetical tokens only with word length greater than 3
    tokens = [w for w in tokens if len(w) > 3]
#     stems = stem_tokens(tokens, stemmer)
    return tokens


# Stop words
# my_stop_words = stem_tokens(text.ENGLISH_STOP_WORDS.union(set(stop_words)),
#                             stemmer)
my_stop_words = text.ENGLISH_STOP_WORDS.union(set(stop_words))

def countvec(docs,
             tokenizer=tokenize_and_pos,
             ngram_range=(2, 3),
             stop_words=my_stop_words,
             min_df=10,
             max_df=0.9):
    """
    Generate document-term inputs for topic modeling with LSA and NMF(doc_term_mat) and LDA(corpus, id2word) using count-vectorizer
    ----
    Input: series of documents (strings)
    Output: Document-term matrix
    """
    count_vectorizer = CountVectorizer(tokenizer=tokenizer,
                                       ngram_range=ngram_range,
                                       stop_words=stop_words,
                                       min_df=min_df,
                                       max_df=max_df)
    count_vectorizer.fit(docs)

    # Create document-term matrix for use in LSA and NMF
    doc_term_mat = count_vectorizer.transform(docs)

    return count_vectorizer, doc_term_mat


def tfidfvec(docs,
             tokenizer=tokenize_and_pos,
             ngram_range=(2, 3),
             stop_words=my_stop_words,
             min_df=10,
             max_df=0.9):
    """
    Generate document-term inputs for topic modeling with LSA and NMF(doc_term_mat) and LDA(corpus, id2word) using tf-idf-vectorizer
    ----
    Input: series of documents (strings)
    Output: Document-term matrix
    """
    tf_vectorizer = TfidfVectorizer(tokenizer=tokenizer,
                                    ngram_range=ngram_range,
                                    stop_words=stop_words,
                                    min_df=min_df,
                                    max_df=max_df)
    tf_vectorizer.fit(docs)

    # Create document-term matrix for use in LSA and NMF
    doc_term_mat = tf_vectorizer.transform(docs)

    return tf_vectorizer, doc_term_mat

In [79]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    """
    Display topics and top associated words given a topic model
    """
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

### Preprocess

In [80]:
# Count Vectorizer
vectorizer_count, doc_term_mat_count = countvec(preprocess(text_feat),ngram_range=(1,1))

In [81]:
doc_term_mat_count.toarray().shape

(14522, 4402)

In [82]:
vocab = pd.DataFrame(doc_term_mat_count.toarray(), columns=vectorizer_count.get_feature_names())

In [83]:
vocab.head()

Unnamed: 0,abbey,ability,able,abode,abound,absolute,abundance,abundant,academy,accent,...,yellow,yoga,yogurt,york,young,younger,yummy,zero,zone,zuma
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [84]:
# See which words are used the most
vocab.sum(axis=0).sort_values(ascending=False)[0:50]

beach             16573
quiet              8331
beautiful          7751
modern             6614
pool               6360
comfortable        6358
cozy               6353
spacious           6224
clean              5805
patio              5360
views              4287
netflix            4002
safe               3773
open               3675
view               3629
outdoor            3614
light              3413
newly              3373
family             3248
business           3167
amazing            3011
bright             2961
backyard           2954
luxury             2933
style              2863
garden             2826
balcony            2779
renovated          2663
remodeled          2589
relax              2482
work               2441
gated              2250
privacy            2227
comfy              2162
charming           2158
fame               1992
couples            1957
deck               1896
convenient         1874
sleep              1826
prime              1817
natural         

In [85]:
vocab.astype(bool).sum(axis=0).sort_values(ascending=False)[0:50]

beach             4150
quiet             3725
beautiful         3315
cozy              2965
comfortable       2878
modern            2691
spacious          2680
clean             2509
patio             2230
pool              2121
netflix           1904
safe              1862
open              1740
light             1620
outdoor           1617
view              1595
views             1578
business          1566
newly             1532
family            1510
amazing           1461
bright            1396
backyard          1346
style             1311
luxury            1284
work              1278
relax             1250
privacy           1237
renovated         1187
balcony           1180
remodeled         1122
charming          1077
comfy             1067
garden            1060
gated             1041
convenient         993
couples            977
transportation     945
sleep              891
smoking            889
peaceful           874
fame               869
relaxing           860
natural    

In [86]:
# # TF-IDF Vectorizer
# vectorizer_tf, doc_term_mat_tf= tfidfvec(preprocess(text_feat),ngram_range=(1,2))
# doc_term_mat_tf.toarray().shape

In [87]:
# vocab_tf = pd.DataFrame(doc_term_mat_tf.toarray(), columns=vectorizer_tf.get_feature_names())

## Topic Modeling

### LSA

In [88]:
lsa_c = TruncatedSVD(20)
doc_topic_lsa_c = lsa_c.fit_transform(doc_term_mat_count)
print('Explained Variance Ratio per topic:\n',lsa_c.explained_variance_ratio_)

topic_word = pd.DataFrame(lsa_c.components_.round(3),
             columns = vectorizer_count.get_feature_names())
display_topics(lsa_c, vectorizer_count.get_feature_names(), 10)

Explained Variance Ratio per topic:
 [0.03762693 0.02846406 0.01549927 0.01325538 0.01151103 0.01000239
 0.00933672 0.00901816 0.00851796 0.00812501 0.00787523 0.00769106
 0.00716445 0.00685841 0.00644821 0.00598066 0.00551982 0.00542835
 0.00527348 0.00505858]

Topic  0
beach, quiet, beautiful, modern, comfortable, cozy, spacious, patio, pool, clean

Topic  1
beach, shoreline, boards, boogie, pike, breezes, peninsula, silicon, alamitos, strand

Topic  2
pool, views, luxury, jacuzzi, view, balcony, beach, swimming, outdoor, modern

Topic  3
modern, spacious, netflix, luxury, views, newly, open, hulu, renovated, comfortable

Topic  4
beautiful, patio, garden, views, outdoor, deck, open, view, light, trees

Topic  5
beautiful, clean, comfortable, spacious, balcony, business, amazing, view, fame, cozy

Topic  6
quiet, modern, views, beautiful, safe, luxury, balcony, amazing, peaceful, beach

Topic  7
views, spacious, view, quiet, balcony, amazing, deck, comfortable, panoramic, breathtakin

In [97]:
# Save
pd.DataFrame(doc_topic_lsa_c).to_pickle('data/lax_lsa_feats.pickle')

In [89]:
# lsa_tf = TruncatedSVD(20)
# doc_topic_lsa_t = lsa_tf.fit_transform(doc_term_mat_tf)
# print('Explained Variance Ratio per topic:\n',lsa_tf.explained_variance_ratio_)

# topic_word = pd.DataFrame(lsa_tf.components_.round(3),
#              columns = vectorizer_tf.get_feature_names())
# display_topics(lsa_tf, vectorizer_tf.get_feature_names(), 30)

### NMF

In [90]:
nmf_c = NMF(20)
doc_topic_nmf_c = nmf_c.fit_transform(doc_term_mat_count)
# nmf_model.explained_variance_ratio_
topic_word = pd.DataFrame(nmf_c.components_.round(3),
             columns = vectorizer_count.get_feature_names())
display_topics(nmf_c, vectorizer_count.get_feature_names(), 10)


Topic  0
beach, amazing, vacation, sunny, bright, getaway, trendy, charming, breeze, yoga

Topic  1
spacious, bright, balcony, convenient, fabulous, makes, secure, spaces, plush, closets

Topic  2
pool, jacuzzi, swimming, heated, luxury, gated, complex, sauna, resort, security

Topic  3
modern, luxury, design, designed, chic, style, decor, gorgeous, furnishings, designer

Topic  4
beautiful, beaches, amazing, lovely, relax, conveniently, trees, style, deck, peaceful

Topic  5
patio, relax, morning, outside, charming, relaxing, work, lovely, driveway, completely

Topic  6
quiet, safe, peaceful, residential, backyard, hiking, share, transportation, convenient, gated

Topic  7
views, deck, amazing, gorgeous, panoramic, incredible, stunning, star, balcony, luxury

Topic  8
cozy, comfy, relax, work, charming, cute, sleep, comes, relaxing, style

Topic  9
netflix, hulu, prime, hdtv, luxury, roku, comfortably, streaming, gated, sleep

Topic  10
clean, safe, comfy, privacy, hostel, respect, s

In [95]:
# Save
pd.DataFrame(doc_topic_nmf_c).to_pickle('data/lax_nmf_feats.pickle')

In [96]:
doc_topic_nmf_c

array([[1.14993108e-03, 0.00000000e+00, 0.00000000e+00, ...,
        3.89339308e-02, 1.18815239e-02, 0.00000000e+00],
       [7.46803072e-01, 1.24239736e-01, 0.00000000e+00, ...,
        3.70403942e-04, 9.69600911e-03, 0.00000000e+00],
       [1.65527478e-01, 6.69432117e-03, 5.79780110e-03, ...,
        4.63385777e-03, 1.92926204e-01, 1.06593830e-02],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 9.68756591e-02],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.67330684e-03, 1.13313112e-02, 0.00000000e+00, ...,
        7.78122524e-04, 0.00000000e+00, 6.58660248e-02]])

In [91]:
# nmf_t = NMF(20)
# doc_topic_nmf_t = nmf_t.fit_transform(doc_term_mat_tf)
# # nmf_model_tf = doc_topic
# # nmf_model.explained_variance_ratio_
# topic_word = pd.DataFrame(nmf_t.components_.round(3),
#              columns = vectorizer_tf.get_feature_names())
# display_topics(nmf_t, vectorizer_tf.get_feature_names(), 30)

### LDA

In [92]:
lda_c = LatentDirichletAllocation(n_components=20,max_iter=50)
doc_topic_lda_c = lda_c.fit_transform(doc_term_mat_count)

In [93]:
topic_word = pd.DataFrame(lda_c.components_.round(3),
             columns = vectorizer_count.get_feature_names())
display_topics(lda_c, vectorizer_count.get_feature_names(), 10)


Topic  0
beautifully, heights, professional, decorated, sqft, sure, allows, utilities, theater, class

Topic  1
newly, remodeled, renovated, recently, clean, completely, security, privacy, gated, units

Topic  2
views, view, gorgeous, modern, luxury, deck, amazing, luxurious, beautiful, balcony

Topic  3
outdoor, open, patio, garden, light, modern, style, trees, relax, deck

Topic  4
pool, jacuzzi, balcony, luxury, swimming, beautiful, complex, view, gated, resort

Topic  5
netflix, prime, hulu, sleep, hdtv, comfortably, spacious, comfortable, heat, gated

Topic  6
beach, villa, course, balcony, golf, beaches, view, beautiful, vacation, deck

Topic  7
duplex, east, wall, historic, light, completely, relax, corner, natural, comfortably

Topic  8
clean, quiet, safe, cozy, comfortable, convenient, transportation, share, single, family

Topic  9
quiet, hiking, beautiful, smoking, trails, family, bowl, spacious, rock, patio

Topic  10
business, couples, solo, families, adventurers, kids, f

In [94]:
# Save
pd.DataFrame(doc_topic_lda_c).to_pickle('data/lax_lda_feats.pickle')

In [100]:
pd.DataFrame(doc_topic_lda_c).describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
count,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0,14522.0
mean,0.023277,0.04001,0.051795,0.094056,0.046635,0.045474,0.02996,0.039428,0.082174,0.046227,0.03904,0.049459,0.025906,0.090315,0.039648,0.048094,0.065303,0.050052,0.059387,0.033762
std,0.080585,0.091741,0.115713,0.167998,0.113733,0.118392,0.085672,0.094885,0.15847,0.108927,0.108255,0.109398,0.079835,0.152571,0.108922,0.113935,0.144224,0.104162,0.121789,0.089312
min,0.000327,0.000336,0.000336,0.000385,0.000327,0.000327,0.000336,0.000327,0.000327,0.000327,0.000327,0.000327,0.000327,0.000327,0.000327,0.000336,0.000327,0.000327,0.000345,0.000327
25%,0.000769,0.000833,0.000862,0.000943,0.000806,0.000833,0.000781,0.000806,0.000794,0.000794,0.00082,0.00082,0.000794,0.000877,0.000781,0.000847,0.000794,0.000877,0.000833,0.000769
50%,0.001163,0.001471,0.001471,0.001786,0.001282,0.001351,0.001163,0.001316,0.001471,0.00125,0.001316,0.001389,0.00119,0.001852,0.00122,0.001389,0.001316,0.001563,0.001563,0.00119
75%,0.003846,0.042475,0.05,0.132099,0.025,0.025,0.004167,0.016667,0.100399,0.016667,0.024395,0.05,0.004545,0.142296,0.0125,0.047421,0.05,0.050349,0.056677,0.008333
max,0.983036,0.98956,0.991204,0.988953,0.986429,0.987333,0.987333,0.982407,0.985606,0.954762,0.986986,0.984167,0.987333,0.968086,0.979787,0.989674,0.984921,0.981,0.962,0.970312


In [490]:
# lda_t = LatentDirichletAllocation(n_components=20,max_iter=50)
# doc_topic_lda_t = lda_t.fit_transform(doc_term_mat_tf)

In [491]:
# topic_word = pd.DataFrame(lda_t.components_.round(3),
#              columns = vectorizer_tf.get_feature_names())
# display_topics(lda_t, vectorizer_tf.get_feature_names(), 30)