## TF.IDF and related algorithms ##  

- Just find the words with highest $tf * idf$ score in each review.  
- SVD the document matrix, parameterized by tf.idf scores.    
- Negative Matrix Factorization, to find the representing topics.  


In [6]:
import tensorflow as tf
import numpy as np
import pandas as pd
import json
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords


In [2]:
city_desc = pd.read_csv("../data/hotels/data/" + "beijing" + ".csv")
for index, row in city_desc.iterrows():
    print ("index is {}, row[:,0] is {}".format(index, row[0]))

index is china_beijing_holiday_inn_central_plaza, row[:,0] is holiday inn central plaza
index is china_beijing_hilton_beijing_wangfujing, row[:,0] is hilton beijing wangfujing
index is china_beijing_hotel_g, row[:,0] is hotel g
index is china_beijing_the_regent_beijing, row[:,0] is the regent beijing
index is china_beijing_the_st_regis_beijing, row[:,0] is the st regis beijing
index is china_beijing_park_plaza_beijing_wangfujing, row[:,0] is park plaza beijing wangfujing
index is china_beijing_the_ritz_carlton_huamao_center, row[:,0] is the ritz carlton huamao center
index is china_beijing_the_opposite_house, row[:,0] is the opposite house
index is china_beijing_double_happiness_courtyard_hotel, row[:,0] is double happiness courtyard hotel
index is china_beijing_intercontinental_financial_street_beijing, row[:,0] is intercontinental financial street beijing
index is china_beijing_michael_s_house_in_beijing, row[:,0] is michael s house in beijing
index is china_beijing_spring_garden_cou

In [32]:
f = open("../data/hotels/data/beijing/china_beijing_kellys_courtyard")
s = f.read()
sl = s.split('\t')
for i in range(len(sl)):
    item = sl[i].strip("\n")
    print ("{}: {}".format(i, item))
    print ("--------------------------")

0: Oct 29 2009 
--------------------------
1: Highly recommended - what a great place!
--------------------------
2: On the recommendation of a friend, I stayed at Kelly's Courtyard for two nights after a business trip to Zhengzhou.The place is quiet, clean and comfortable. Comfortable rooms, great staff, a nice rooftop terrace for relaxing, and free wireless. Though a little hard to find for the uninitiated - have the Chinese address written out and ask locals if you are having trouble - it's well worth seeking out. When I return to Beijing, I will return to Kelly's Courtyard.
--------------------------
3: Oct 18 2009 
--------------------------
4: Small 
--------------------------
5: We checked in at Kelly's courtyard in the afternoon of 16 Oct. The location is a bit hard to find but once we got inside the place we felt it was worth it. This 9-rooms accommodation is tastefully decorated, quiet and really cozy. They put jazzy ambient music on all day - which is lovely. For the bedroom

In [33]:
f = open("../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street")
s = f.read()
sl = s.split('\t')
for i in range(len(sl)):
    item = sl[i].strip("\n")
    print ("{}: {}".format(i, item))
    print ("--------------------------")

0: Nov 1 2009 
--------------------------
1: Very impressive
--------------------------
2: First time to Beijing. This hotel looks brand new. The staff is superb - very helpful in anything that they were asked to do. The rooms are fantastic. I was only there a few days, but this hotel made a very good base of operations - highly, highly recommended.
--------------------------
3: Sep 27 2009 
--------------------------
4: Amazing! The best hotel i ever stayed
--------------------------
5: This hotel is the perfect definition of what you look for in an hotel! The rooms are huge full of special details, everything that you need is in that room! The bathroom is big with bathtube and even TV! No words, there is even a night light! The spa is great and if you just want to chill a bit enjoy the swimming pool! The breakfast is amazing! Since Beijing is enormous, the location is perfect, 15m by car from the city centre! Only a detail could be better, wifi should be free of charges!
------------

In [31]:
## Process files: convert them into one file: list of comments
cities = ['beijing', 'chicago', 'dubai', 'las-vegas', 'london', 'montreal', 'new-delhi',
          'new-york-city', 'san-francisco', 'shanghai']
comments = [] # list of tuple: (timestamp, title, review)
for city_name in cities:
    city_desc = pd.read_csv("../data/hotels/data/" + city_name + ".csv")
    for hotel_name, _ in city_desc.iterrows():
        filename = "../data/hotels/data/{}/{}".format(city_name, hotel_name)
        try:
            f = open(filename, "r", encoding="ascii")
            sl = f.read()
            sl = sl.split('\t')
            i = 0
            while (i < len(sl)-2):
                timestamp = sl[i].strip("\n")
                title = sl[i+1].strip("\n")
                review = sl[i+2].strip("\n")
                comment = (timestamp, title, review)
                i += 3
                if (len(comment) < 20):
                    print (filename) # FIXME - Why doesn't this code read the reviews following the title??
            comments.append(comment)
        except (UnicodeDecodeError, FileNotFoundError) as e:
            continue
            
with open("../data/comments-LoS.json", "w") as f:
    f.write(json.dumps(comments))

print ("len(comments) = {}".format(len(comments)))
print ("One comment a line:")
for i in range(5):
    print (comments[i])

../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_kellys_courtyard
../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street
../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street
../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street
../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street
../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street
../data/hotels/data/beijing/china_beijing_the_ritz_carlton_beijing_financial_street
../data/hotels/data/beij

In [None]:
def reviews2vec():