# Using Natural Language Processing to Identify Latent Features

Airbnb hosts often provide rich and detailed descriptions of their homes on their listing pages in the hopes that they can sway a visitor to book their property.

Using Non-Negative Matrix Factorization, we will try to identify latent features from these authored descriptions that we can add as additional features to support our regression modeling efforts. 

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.decomposition import NMF

import matplotlib.pyplot as plt
import json

import nltk
from nltk.tokenize import sent_tokenize, regexp
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize 

import src.airbnb_read_data_helper as read
import src.airbnb_NLP_helper as nlp


## Load detailed listing information

In [2]:
listing_details_raw_df = read.import_txtfile_as_df('data/airbnb_scraping/Seattle/Seattle_scraped_listing_info_20170317_COMBINED.txt')

In [3]:
listing_details_df = listing_details_raw_df.convert_objects(convert_numeric=True)

listing_details_df.drop('Save to Wish ListRoom TypeEntire home/aptProperty TypeAccommodates2Bedrooms$99 Barn 11', 
                        axis=1, inplace=True)

  if __name__ == '__main__':


## Load in scraped metadata

In [4]:
scraped_meta_raw_df, scraped_meta_json = read.import_json_as_df('data/airbnb_scraping/Seattle/comb_temp2')

In [5]:
scraped_meta_price_df = read.convert_json_price_to_df(scraped_meta_json)
scraped_meta_desc_df = read.convert_json_desc_to_df(scraped_meta_json)
scraped_meta_amen_df = read.convert_json_amenities_to_df(scraped_meta_json)
scraped_meta_df = pd.concat([scraped_meta_price_df,scraped_meta_desc_df,scraped_meta_amen_df], axis=1)
scraped_meta_df = scraped_meta_df.reset_index()
scraped_meta_df = scraped_meta_df.rename(columns={'index':'prop_id'})

In [6]:
full_scraped_df = scraped_meta_df.merge(listing_details_df, how='inner', left_on="prop_id", right_on="hosting_id")

full_scraped_df.space = full_scraped_df.space.str.lower().str.strip().str.replace("\n"," ")
full_scraped_df = full_scraped_df[~full_scraped_df.space.isnull()]
full_scraped_df.space = full_scraped_df.space.str.encode("utf-8")

## Text Analysis

Building the right stopword dictionary ensures that we get more signal than noise when conducting natural language processing.  

This is a curated list of very common words that we do not want to be used in tf or tf-idf calculations. 

In [7]:
airbnb_stopwords = [
'and', 'the', 'to', 'a', 'in', 'of', 'with', 'is',  'on', 'you', 'this','our', 'has', 'are', 'for','your', 'out', 'there', 'will',
'can', 'be',  'but', 'its', 're','which','here', 'or',  'we', 'it',  'an','from','by','my', 'have', 'at', 'as', 'just',
'room','bedroom','bed', 'home','house','place','location', 'space', 'host', 
'seattle',
'solo', 'adventurers', 'business', 'travelers', 'couples', 
'youll', 'love', 'because', 'www', 'airbnb', 'com', 'https'
]

## Analyzing Airbnb Descriptions - Term Frequency

In [14]:
tf = CountVectorizer(stop_words=airbnb_stopwords, 
                     max_features=10000, ngram_range=(2,3),
                     #min_df=0.02, max_df=0.98,
                     #tokenizer=tokenize_and_stem)
                    )

tf_matrix = tf.fit_transform(full_scraped_df.space)
tf_vocab = np.array(tf.get_feature_names())
tf_matrix_sum = np.sum(tf_matrix.toarray(),axis=0)
sorted_ind = np.argsort(tf_matrix_sum)[::-1]

In [15]:
nlp.get_top_term_frequency(tf=tf, df=full_scraped_df, column='space', 
                           by='price', lower_lim=100, upper_lim=9999, 
                           num_words=10)

Number of properties: 97
Top term frequency based on price, 100 > price > 9999

                          WORD, COUNT,   PCT
---------------------------------------------
                street parking,    15, 15.46%
              private bathroom,    13, 13.40%
                   living area,    13, 13.40%
              walking distance,    12, 12.37%
                    lake union,    12, 12.37%
                  capitol hill,    11, 11.34%
                    queen size,    11, 11.34%
                  private bath,     9, 9.28%
                kitchen living,     9, 9.28%
                    queen anne,     9, 9.28%


In [16]:
nlp.get_top_term_frequency(tf=tf, df=full_scraped_df, column='space', 
                           by='price', lower_lim=50, upper_lim=75, 
                           num_words=10)

Number of properties: 321
Top term frequency based on price, 50 > price > 75

                          WORD, COUNT,   PCT
---------------------------------------------
                street parking,    32, 9.97%
                    queen size,    31, 9.66%
                   easy access,    31, 9.66%
              private bathroom,    28, 8.72%
                    light rail,    27, 8.41%
                  washer dryer,    27, 8.41%
                        do not,    26, 8.10%
                  capitol hill,    25, 7.79%
                 full bathroom,    23, 7.17%
                    coffee tea,    23, 7.17%


In [17]:
nlp.get_top_term_frequency(tf=tf, df=full_scraped_df, column='space', 
                           by='price', lower_lim=0, upper_lim=50, 
                           num_words=10)

Number of properties: 216
Top term frequency based on price, 0 > price > 50

                          WORD, COUNT,   PCT
---------------------------------------------
                    main floor,    23, 10.65%
                street parking,    20, 9.26%
                    coffee tea,    19, 8.80%
              walking distance,    18, 8.33%
                     small fan,    16, 7.41%
                closet hangers,    16, 7.41%
                    queen size,    14, 6.48%
                  coffee maker,    14, 6.48%
                        do not,    13, 6.02%
                   please keep,    13, 6.02%


In [18]:
nlp.get_top_term_frequency(tf=tf, df=full_scraped_df, column='space', 
                           by='price', lower_lim=75, upper_lim=100, 
                           num_words=10)

Number of properties: 192
Top term frequency based on price, 75 > price > 100

                          WORD, COUNT,   PCT
---------------------------------------------
                street parking,    27, 14.06%
              private bathroom,    25, 13.02%
                  washer dryer,    24, 12.50%
                   puget sound,    21, 10.94%
                  coffee maker,    19, 9.90%
                   during stay,    17, 8.85%
                    queen size,    17, 8.85%
                      bus stop,    16, 8.33%
                   minute walk,    16, 8.33%
                        let me,    15, 7.81%


Based on term frequency alone, there doesn't seem to be a lot of insights to take away. All pricing tiers have similar word breakdowns. 

Let's look at topic modeling to get to deeper insights.

# Topic Modeling

### NMF Factorization with term-frequency

In [19]:
tf_NMF = CountVectorizer(stop_words=airbnb_stopwords, max_features=10000, ngram_range=(2,2), min_df=2)
tf_NMF_matrix = tf_NMF.fit_transform(full_scraped_df.space)
tf_NMF_vocab= np.array(tf_NMF.get_feature_names())

In [25]:
nmf = NMF(n_components=50)
nmf.fit(tf_NMF_matrix.toarray())

NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=200,
  n_components=50, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

In [26]:
tf_W = nmf.transform(tf_NMF_matrix.toarray())
tf_H = nmf.components_
print 'RSS = %.2f' % nmf.reconstruction_err_

RSS = 126.89


In [27]:
tf_W.shape, tf_H.shape

((826, 50), (50, 8426))

In [28]:
tf_W = pd.DataFrame(tf_W,index=full_scraped_df.space)
tf_H = pd.DataFrame(tf_H,columns=tf_NMF_vocab)

#### Printing top words per topic

In [29]:
for row in xrange(tf_H.shape[0]):
    print row, ', '.join(list(tf_H.iloc[row].sort_values(ascending=False).index[:6]))
    #print (list(H.iloc[row].sort_values(ascending=False)[:10]))

0 me know, let me, if need, would like, any time, block away, peruvian breakfast
1 instruction please, do not, parking spot, etc please, please do, please read, check instruction
2 feel free, guests rooms, street parking, master bathroom, guest rooms, free ask, front door
3 hot tub, garden guesthouse, fully equipped, private garden, 15 min, jacuzzi hot, easy access
4 full sized, walking distance, good fit, 10 years, free standing, within walking, guests use
5 shared bathrooms, get up, before booking, unusual want, steep ladder, not meant, meant sleeping
6 deck view, view downtown, 3rd fl, eat bar, 11 am, full bath, full size
7 lots fun, feel free, arts district, easy access, if need, free ask, do not
8 clean sheets, extra blankets, best part, part about, couldn closer, about staying, staying experience
9 parking available, full kitchen, fully furnished, fully stocked, if looking, rooms available, minutes away
10 easy access, shared bath, across hall, bath across, access downtown, bus l

In [11]:
nlp.get_top_term_frequency(tf=tf, df=full_scraped_df, column='space', 
                           by='price', lower_lim=50, upper_lim=75, 
                           num_words=10)

Number of properties: 321
Top term frequency based on price, 50 > price > 75

                          WORD, COUNT,   PCT
---------------------------------------------
                street parking,    32, 9.97%
                    queen size,    31, 9.66%
                   easy access,    31, 9.66%
              private bathroom,    28, 8.72%
                    light rail,    27, 8.41%
                  washer dryer,    27, 8.41%
                        do not,    26, 8.10%
                  capitol hill,    25, 7.79%
                 full bathroom,    23, 7.17%
                    coffee tea,    23, 7.17%


Looking through the NMF topic list above, there indeed a few noteworthy features that have floated to the top.

- Topic 6 seems to be about scenic views. 
- Topic 3 seems to highlight luxurious features, such as hot tubs, decks, and gardens.
- Topic 21 seems to highlight comfort features. 
- Topic 49 seems to be about televisions.

Let's see the weights of each of these topics for different properties. Perhaps the higher priced properties exhibit higher weights for luxurious features and views? 

**What are the top topics for properties > 100?**

In [30]:
nlp.get_topic_weights_prop_range(full_scraped_df, 'price', tf_W, tf_H, 100, 9999)

n =  80
   TOPIC,    WT, SUMWT, WORDS
       6, 0.109, 0.109, deck view, view downtown, 3rd fl, eat bar, 11 am
      12, 0.058, 0.166, lake union, living area, walking distance, kitchen living, bathroom kitchen
      47, 0.050, 0.217, capitol hill, full size, university washington, basement that, all amenities
       3, 0.046, 0.262, hot tub, garden guesthouse, fully equipped, private garden, 15 min
      33, 0.043, 0.306, screen tv, first floor, flat screen, 15 minute, alki beach
       2, 0.043, 0.348, feel free, guests rooms, street parking, master bathroom, guest rooms
      45, 0.038, 0.386, street parking, off street, main floor, washer dryer, minute walk
      42, 0.025, 0.411, second floor, first floor, kitchenette bathroom, washer dryer, free use
      32, 0.024, 0.435, shampoo conditioner, queen size, body wash, hair dryer, conditioner body
      39, 0.021, 0.455, light rail, blocks away, rail station, lake washington, minutes away
      29, 0.015, 0.471, queen anne, minute w

**What are the top topics for properties with 50 < price < 100?**

In [32]:
nlp.get_topic_weights_prop_range(full_scraped_df, 'price', tf_W, tf_H, 50, 100)

n =  467
   TOPIC,    WT, SUMWT, WORDS
      45, 0.030, 0.030, street parking, off street, main floor, washer dryer, minute walk
      42, 0.030, 0.060, second floor, first floor, kitchenette bathroom, washer dryer, free use
      32, 0.026, 0.086, shampoo conditioner, queen size, body wash, hair dryer, conditioner body
       9, 0.022, 0.108, parking available, full kitchen, fully furnished, fully stocked, if looking
      39, 0.022, 0.130, light rail, blocks away, rail station, lake washington, minutes away
      10, 0.022, 0.151, easy access, shared bath, across hall, bath across, access downtown
      21, 0.019, 0.170, memory foam, walking distance, foam mattress, coffee tea, within walking
      48, 0.018, 0.189, main floor, private suite, five blocks, blocks greenlake, other adventures
       2, 0.018, 0.207, feel free, guests rooms, street parking, master bathroom, guest rooms
       7, 0.018, 0.225, lots fun, feel free, arts district, easy access, if need
       5, 0.017, 0.242

**What are the top topics for properties with price < 50?**

In [31]:
nlp.get_topic_weights_prop_range(full_scraped_df, 'price', tf_W, tf_H, 0, 50)

n =  216
   TOPIC,    WT, SUMWT, WORDS
       8, 0.047, 0.047, clean sheets, extra blankets, best part, part about, couldn closer
       1, 0.042, 0.089, instruction please, do not, parking spot, etc please, please do
      18, 0.041, 0.130, small fan, rooms that, that share, closet hangers, please keep
      16, 0.032, 0.162, brand new, if need, 2nd floor, feel free, master suite
      20, 0.031, 0.193, mini refrigerator, coffee maker, does not, other bathroom, screen tv
      44, 0.029, 0.221, three rooms, me know, let me, please let, know if
      35, 0.027, 0.248, cool summer, double closet, winter cool, fruit trees, apple orchard
      11, 0.027, 0.275, minute walk, if desired, one block, bus lines, ballard locks
      24, 0.026, 0.300, seed milk, if need, one own, year round, bring own
       3, 0.019, 0.319, hot tub, garden guesthouse, fully equipped, private garden, 15 min
      17, 0.018, 0.338, if she, minutes south, walking distance, within walking, high speed
      13, 0.01

## NMF with tfidf

In [33]:
tfidf_NMF = TfidfVectorizer(stop_words=airbnb_stopwords, max_features=10000, ngram_range=(2,3), min_df=2)
tfidf_NMF_matrix = tfidf_NMF.fit_transform(full_scraped_df.space)
tfidf_NMF_vocab= np.array(tfidf_NMF.get_feature_names())

In [34]:
nmf_tfidf = NMF(n_components=50)
nmf_tfidf.fit(tfidf_NMF_matrix.toarray())

NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=200,
  n_components=50, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

In [35]:
tfidf_W = nmf_tfidf.transform(tfidf_NMF_matrix.toarray())
tfidf_H = nmf_tfidf.components_
print 'RSS = %.2f' % nmf_tfidf.reconstruction_err_

RSS = 25.67


In [36]:
tfidf_W = pd.DataFrame(tfidf_W,index=full_scraped_df.space)
tfidf_H = pd.DataFrame(tfidf_H,columns=tfidf_NMF_vocab)

#### Printing top words per topic

In [38]:
for row in xrange(tfidf_H.shape[0]):
    print "Topic", row, ":", ', '.join(list(tfidf_H.iloc[row].sort_values(ascending=False).index[:4]))
    #print (list(H.iloc[row].sort_values(ascending=False)[:10]))

Topic 0 : private bathroom, washer dryer, bathroom queen, private bathroom queen
Topic 1 : that share bathrooms, mind also please, full bathrooms guest, bathrooms guest
Topic 2 : instruction please, etc please do, please do not, parking spot
Topic 3 : currently not, perfect spot, refrigerator dishwasher microwave, freeway supermarkets
Topic 4 : easy access, shared bath, shared bath across, bath across
Topic 5 : paint new, find availability, machine dedicate whole, machine dedicate
Topic 6 : puget sound, olympic mountains, queen anne, views puget sound
Topic 7 : booked first, right through, same hallway booked, upon entering
Topic 8 : three rooms, may get, three rooms so, seek if
Topic 9 : clean sheets, part about, best part about, best part
Topic 10 : view downtown, deck view, 3rd fl, eat bar
Topic 11 : off street, off street parking, street parking, street parking available
Topic 12 : near convenient transportation, all comfort located, near convenient, comfort located outside
Topic 1

**What are the top topics for properties > 100?**

In [42]:
nlp.get_topic_weights_prop_range(full_scraped_df, 'price', tfidf_W, tfidf_H, 100, 9999, num_words=4)

n =  80
   TOPIC,    WT, SUMWT, WORDS
      10, 0.015, 0.015, view downtown, deck view, 3rd fl, eat bar
       0, 0.014, 0.029, private bathroom, washer dryer, bathroom queen, private bathroom queen
      39, 0.013, 0.042, own private, well lit, private full, private entrance
      43, 0.012, 0.054, equipped kitchen, fully equipped, fully equipped kitchen, wireless internet access
      35, 0.011, 0.065, free street parking, free street, street parking, residential street
      17, 0.009, 0.074, feel free, feel free ask, free ask, master bathroom
      25, 0.008, 0.082, first floor, second floor, floor features, floor three
      36, 0.008, 0.090, main floor, floor guests, bedrooms main, bedrooms main floor
      19, 0.007, 0.096, capitol hill, international district, first hill, downtown capitol
      33, 0.007, 0.103, light rail, lake washington, easy access, view lake
      30, 0.006, 0.109, flat screen, screen tv, flat screen tv, relaxing stay
      20, 0.006, 0.115, natural light,

In [43]:
nlp.get_topic_weights_prop_range(full_scraped_df, 'price', tfidf_W, tfidf_H, 50, 100, num_words=4)

n =  467
   TOPIC,    WT, SUMWT, WORDS
      39, 0.012, 0.012, own private, well lit, private full, private entrance
       4, 0.011, 0.023, easy access, shared bath, shared bath across, bath across
       3, 0.008, 0.031, currently not, perfect spot, refrigerator dishwasher microwave, freeway supermarkets
      33, 0.008, 0.039, light rail, lake washington, easy access, view lake
      13, 0.007, 0.046, shampoo conditioner body, conditioner body, conditioner body wash, shampoo conditioner
      18, 0.007, 0.053, memory foam, memory foam mattress, foam mattress, new memory foam
       0, 0.007, 0.060, private bathroom, washer dryer, bathroom queen, private bathroom queen
      46, 0.007, 0.067, blocks away, back yard, few blocks, few blocks away
      35, 0.006, 0.073, free street parking, free street, street parking, residential street
       2, 0.006, 0.080, instruction please, etc please do, please do not, parking spot
      30, 0.006, 0.086, flat screen, screen tv, flat screen tv, 

In [44]:
nlp.get_topic_weights_prop_range(full_scraped_df, 'price', tfidf_W, tfidf_H, 0, 50, num_words=4)

n =  216
   TOPIC,    WT, SUMWT, WORDS
       2, 0.016, 0.016, instruction please, etc please do, please do not, parking spot
       9, 0.015, 0.031, clean sheets, part about, best part about, best part
       7, 0.011, 0.042, booked first, right through, same hallway booked, upon entering
      21, 0.011, 0.053, small fan, that mind please, please also turn, please also
      36, 0.011, 0.064, main floor, floor guests, bedrooms main, bedrooms main floor
       5, 0.011, 0.075, paint new, find availability, machine dedicate whole, machine dedicate
       1, 0.009, 0.084, that share bathrooms, mind also please, full bathrooms guest, bathrooms guest
      49, 0.009, 0.093, owner lives, light view, door lock, curtain privacy
      16, 0.009, 0.102, cool summer, winter cool, winter cool summer, double closet
      37, 0.008, 0.111, full size, study desk, medium sized, sized 10
      29, 0.008, 0.118, walking distance, within walking distance, within walking, stores within
      46, 0.007, 

## Observations

NMF topic modeling with TFIDF matrices reveal interesting traits about different property types. 

For properties priced > 100, the top 5 topics are:
1. View
2. Private bathroom
3. Private Entrance
4. Kitchen
5. Parking

For properties priced < 50, the top 5 topics are:
1. Instructions
2. Clean sheets
3. Hallway
4. Instructions
5. Floor layout

The data suggests that properties that offer views, private bathrooms, and private entrances are often priced higher.  This aligns with our intuition.


## Next steps

Include topics of interest as features in the regression model!