## TF.IDF and related algorithms ##  

- Just find the words with highest $tf * idf$ score in each review.  
- SVD the document matrix, parameterized by tf.idf scores.    
- Negative Matrix Factorization, to find the representing topics.  


In [50]:
import tensorflow as tf
import numpy as np
import scipy
import pandas as pd
import json
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from nltk.corpus import stopwords
from time import time

In [2]:
city_desc = pd.read_csv("../data/hotels/data/" + "beijing" + ".csv")
for index, row in city_desc.iterrows():
    print ("index is {}, row[:,0] is {}".format(index, row[0]))

index is china_beijing_holiday_inn_central_plaza, row[:,0] is holiday inn central plaza
index is china_beijing_hilton_beijing_wangfujing, row[:,0] is hilton beijing wangfujing
index is china_beijing_hotel_g, row[:,0] is hotel g
index is china_beijing_the_regent_beijing, row[:,0] is the regent beijing
index is china_beijing_the_st_regis_beijing, row[:,0] is the st regis beijing
index is china_beijing_park_plaza_beijing_wangfujing, row[:,0] is park plaza beijing wangfujing
index is china_beijing_the_ritz_carlton_huamao_center, row[:,0] is the ritz carlton huamao center
index is china_beijing_the_opposite_house, row[:,0] is the opposite house
index is china_beijing_double_happiness_courtyard_hotel, row[:,0] is double happiness courtyard hotel
index is china_beijing_intercontinental_financial_street_beijing, row[:,0] is intercontinental financial street beijing
index is china_beijing_michael_s_house_in_beijing, row[:,0] is michael s house in beijing
index is china_beijing_spring_garden_cou

In [10]:
f = open("../data/hotels/data/beijing/china_beijing_tianlun_dynasty_hotel")
s = f.read()
sl = s.split('\t')
i = 0
while (i < len(sl) - 2):

    timestamp = sl[i].strip("\n")
    title = sl[i+1].strip("\n")
    review = sl[i+2].strip("\n")
    print ("{}: {}".format(i, timestamp))
    print ("--------------------------")
    print ("{}: {}".format(i+1, title))
    print ("--------------------------")
    print ("{}: {}".format(i+2, review))
    print ("--------------------------")
    i += 3

0: Nov 19 2009 
--------------------------
1: Sunworld Dynasty Hotel - Better than expected
--------------------------
2: At the last minute, we booked Sunworld Dynasty hotel through the Agoda website. We stayed for five nights and the service we received is on par with standards of a five star foreign-own hotels (e.g. Sheraton). Most staff could communicate in English, which was helpful. (Not a lot of locals in the city can speak/understand English so it was nice to be able to speak our first language at the end of the day). Complimentary bottles of water were provided every day. We had a fair size room with a king bed. Bathroom was also a nice size with a shower and a bathtub. Turn down service was a nice touch. Note that there is only one non-smoking floor. Gym and pool facilities were also very good.
--------------------------
3: Oct 29 2009 
--------------------------
4: Crappy hotel
--------------------------
5: It is now known as the Sunworld Dynasty Hotel. rated itself as 5 sta

In [11]:
f = open("../data/hotels/data/beijing/china_beijing_traders_hotel")
s = f.read()
sl = s.split('\t')
i = 0
while (i < len(sl) - 2):

    timestamp = sl[i].strip("\n")
    title = sl[i+1].strip("\n")
    review = sl[i+2].strip("\n")
    print ("{}: {}".format(i, timestamp))
    print ("--------------------------")
    print ("{}: {}".format(i+1, title))
    print ("--------------------------")
    print ("{}: {}".format(i+2, review))
    print ("--------------------------")
    i += 3

0: Nov 13 2009 
--------------------------
1: Reasonable business hotel
--------------------------
2: I have stayed at this hotel for about half a dozen times during the past 2 - 3 years. I have always been satisfied with them.The rooms are not large, but there's enough room for the bed and working desk. They have always been clean and perfectly functional.This time, somehow the staff at the reception was not too friendly. The staff at the concierge desk, the doormen and the elevator guards were very polite as always.The location is very good for shopping and sightseeing.This hotel is of good value for money at least with our company rate.
--------------------------
3: Aug 27 2009 
--------------------------
4: Old weary and very overpiced... oh and badly managed!
--------------------------
5: 4 years ago I checked into the Traders for what I swore would be the last time. Upon arrival they said - oh we made a mistake re your booking, please wait in the louge while we fix it... 1.5 hour

In [61]:
## Process files: convert them into one file: list of comments
cities = ['beijing', 'chicago', 'dubai', 'las-vegas', 'london', 'montreal', 'new-delhi',
          'new-york-city', 'san-francisco', 'shanghai']
comments = [] # list of tuple: (timestamp, title, review)
ignored_insufficient_comment_length = 0
taken_comment = 0
FnFerror = 0
unicode_errors = 0
hotel_number = 0
total_comment = 0
for city_name in cities:
    city_desc = pd.read_csv("../data/hotels/data/" + city_name + ".csv")
    for hotel_name, _ in city_desc.iterrows():
        filename = "../data/hotels/data/{}/{}".format(city_name, hotel_name)
        hotel_number += 1
        try:
            f = open(filename, "r")
            sl = f.read()
            sl = sl.split('\t')
            i = 0
            while (i < len(sl)-2):
                total_comment += 1
                timestamp = sl[i].strip("\n")
                title = sl[i+1].strip("\n")
                review = sl[i+2].strip("\n")
                comment = (timestamp, title, review)
                i += 3
                if (len(review) < 20):
                    ignored_insufficient_comment_length += 1
                else:
                    comments.append(comment)
                    taken_comment += 1
        except (UnicodeDecodeError, FileNotFoundError) as e:
            if (e is FileNotFoundError):
                FnFerror += 1
            else:
                unicode_errors += 1
            continue
            
            
print ("Data read done! Out of {} comments from {} hotels, \
taken {}, ignored {} due to insufficient comment length, \
skipped {} due to file not found errors, and {} due to UnicodeDecodeError"
       .format(total_comment, hotel_number, taken_comment, ignored_insufficient_comment_length, FnFerror, unicode_errors))
with open("../data/comments-LoS.json", "w") as f:
    f.write(json.dumps(comments))

print ("len(comments) = {}".format(len(comments)))
print ("One comment a line:")
for i in range(5):
    print (comments[i])

Data read done! Out of 3513 comments from 3105 hotels, taken 3257, ignored 256 due to insufficient comment length, skipped 0 due to file not found errors, and 2827 due to UnicodeDecodeError
len(comments) = 3257
One comment a line:
('Oct 29 2009 ', 'Highly recommended - what a great place!', "On the recommendation of a friend, I stayed at Kelly's Courtyard for two nights after a business trip to Zhengzhou.The place is quiet, clean and comfortable. Comfortable rooms, great staff, a nice rooftop terrace for relaxing, and free wireless. Though a little hard to find for the uninitiated - have the Chinese address written out and ask locals if you are having trouble - it's well worth seeking out. When I return to Beijing, I will return to Kelly's Courtyard.")
('Oct 18 2009 ', 'Small ', "We checked in at Kelly's courtyard in the afternoon of 16 Oct. The location is a bit hard to find but once we got inside the place we felt it was worth it. This 9-rooms accommodation is tastefully decorated, q

In [None]:
# I believe some cities contain much more hotel reviews than currently read. 
# TODO - print the number of reviews read from each city!

In [26]:
with open("../data/comments-LoS.json") as f:
    comments = json.loads(f.read())

def reviews2vec():
    corpus = [item[2] for item in comments]
    vectorizer = TfidfVectorizer(min_df = 2)
    return vectorizer.fit_transform(corpus)
    
print ("Number of documents: {}".format(len(comments)))

A = reviews2vec()
#print ("A is {}".format(A))
# What does row and column here mean?
# M[docid, wordid] = tf.idf frequency

A

Number of documents: 3257


<3257x8649 sparse matrix of type '<class 'numpy.float64'>'
	with 316107 stored elements in Compressed Sparse Row format>

In [27]:
U, S, V = scipy.sparse.linalg.svds(A, k=100)
print ("U.shape={}".format(U.shape))
print ("S.shape={}".format(S.shape))
print ("V.shape={}".format(V.shape))

U.shape=(3257, 500)
S.shape=(500,)
V.shape=(500, 8649)


In [28]:
# TODO - how to get topic model from SVD? Not quite obvious... Let's try LDA first.

In [48]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()

In [54]:
with open("../data/comments-LoS.json") as f:
    comments = json.loads(f.read())
    
def LDAexample():
    corpus = [item[2] for item in comments]
    print ("Finished reading corpus! Number of reviews={}".format(len(corpus)))
    tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=1000,
                                   stop_words=stopwords.words('english'))
    print ("Start fitting tfidf transformations!")
    t0 = time()
    tf = tfidf_vectorizer.fit_transform(corpus)
    print("done in %0.3fs." % (time() - t0))
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    
    
    lda = LatentDirichletAllocation(n_topics=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
    print ("Start fitting LDA model!")
    t0 = time()
    lda.fit(tf)
    print("done in %0.3fs." % (time() - t0))
    
    tf_feature_names = tfidf_vectorizer.get_feature_names()
    n_top_words = 10
    print_top_words(lda, tf_feature_names, n_top_words)
    
LDAexample()

Finished reading corpus! Number of reviews=3257
Start fitting tfidf transformations!
done in 0.384s.
Start fitting LDA model!
done in 1.295s.
Topic #0:
hotel english service beijing spoke staff stay daily rooms dining
Topic #1:
hotel beijing star service subway room staff us everything great
Topic #2:
laurel beijing hotel decorated food good beautiful music courtyard walk
Topic #3:
hotel beijing great room pool good really spacious city spa
Topic #4:
hotel good beijing facilities room staff breakfast buffet great main
Topic #5:
ritz taxi location would hotel subway beijing town decor views
Topic #6:
hotel room stay good staff quot rooms great would place
Topic #7:
beijing hotel great highly vacation brand first good nearby well
Topic #8:
hotel beijing staff us room quite excellent great really wall
Topic #9:
hotel chinese perfect english beijing first offered quite staff everything

