# Generating topics for venue tips in London
* prepare tips data
* preprocessing tips
* generating topics
* evaluation

### 1.prepare tips data
We collect tips from Foursquare. 

There are around 11000 places and about 1700 places have more 200 tips.
All of the tips(English tips) were collected previously. 

In [247]:
# import useful libs
import graphlab as gl
import pandas as pd
import json
import re

In [248]:
# combine two json files: one for venue info, one for tips
file_path = './cleaned_london_venues_data_twitter.json'
json_file = open(file_path)
jsonObj = json.load(json_file)

file_path = './london_tips.json'
json_file = open(file_path)
tips = json.load(json_file)

for id in jsonObj.keys():
    jsonObj[id]['tips'] = tips[id]
df = pd.DataFrame.from_dict(jsonObj,orient='index')

In [249]:
# export it to csv for later use
df['venue_id'] = df.index
sf = gl.SFrame(df)
sf.export_csv('london_venue.csv')

In [254]:
# Sframe shown
sf.head()

category,city,rating,name,tags,country,likes
Burgers,London,8.3,Byron,[burgers],GB,150.0
Hotel,London,9.0,Haymarket Hotel,[],GB,32.0
Palace,London,9.0,Buckingham Palace,"[cultural, guards, palace, queen, royalty, ...",GB,5469.0
Hotel,London,7.2,Macdonald Hotel,[],GB,4.0
Portuguese,London,8.2,O Cantinho de Portugal,[],GB,31.0
Zoo,London,8.9,ZSL London Zoo,"[animals, aquarium, birds, lions, meerkats, ...",GB,823.0
Art Gallery,London,9.5,Tate Britain,"[art, art gallery, gallery, louis vuitton, ...",GB,1281.0
Hotel,London,8.3,The Cavendish London,"[accommodation, 4-star, meeting, lunch, hotel] ...",GB,60.0
Hotel,London,9.2,The Park Tower Knightsbridge ...,"[central london, luxury hotel, spg, starwood] ...",GB,313.0
Hotel,London,8.9,The Halkin by COMO,[],GB,30.0

photo,tips,checkins,description
[https://igx.4sqi.net/img /general/800x600 ...,[Delicious burger ! I can say the bun is the key ...,2419,We make proper hamburgers. Good Scot ...
[https://igx.4sqi.net/img /general/800x600/8C5- ...,[Check out the indoor pool. Now we are ...,1510,
[https://igx.4sqi.net/img /general/800x600/6837 ...,[Skip the irrelevant change of the guards and ...,92542,Buckingham Palace is the working headquarters of ...
[https://igx.4sqi.net/img /general/800x600/6843 ...,"[Amazing English BreakFast., Definitely ...",303,
[https://igx.4sqi.net/img /general/800x600/TTnC ...,"[My favourite portuguese food in London, Relax ...",365,
[https://igx.4sqi.net/img /general/800x600/2513 ...,[Squirrel Monkey: Ask the zoo keeper (not the red ...,14252,"Not to be missed, ZSL London Zoo is the must- ..."
[https://igx.4sqi.net/img /general/800x600 ...,[Often overlooked for the higher profile Tate ...,17161,
[https://igx.4sqi.net/img /general/800x600/ZY2Q ...,"[WiFi access: 1) open browser, select ...",1893,Welcome to The Cavendish London! Thanks for ...
[https://igx.4sqi.net/img /general/800x600/1290 ...,[One of the best hotels in Knightsbridge! it' ...,11020,Situated in the heart of one of London’s most ...
[https://igx.4sqi.net/img /general/800x600/3850 ...,[Wonderful beds! Walk to Amaya for the best In ...,596,A luxury boutique hotel in London’s Belgravia ...

venue_id
4a5f9446f964a520e0bf1fe3
4abcec53f964a520b98720e3
4abe4502f964a520558c20e3
4ac3ba25f964a520919c20e3
4ac51183f964a52046a020e3
4ac51183f964a52048a020e3
4ac51183f964a52049a020e3
4ac518b4f964a52067a020e3
4ac518b4f964a52071a020e3
4ac518b4f964a52073a020e3


### 2. preprocessing
* put all tips for one venue together, store them in a new column 'all_tips'

In [56]:
# put all tips together
def put_all_tips(tips):
    all_tips = ''
    for t in tips:
        all_tips +=t
    return all_tips
sf['all_tips'] = sf['tips'].apply(lambda x:put_all_tips(x))

   * Text cleaning

In [193]:
# get words, dropping punctuations etc.
sf['all_tips'] = sf['all_tips'].apply(lambda x: re.sub("[^a-zA-Z]", " ", x))

* Tokenization
* Bag-of-words representation
* Stop words and less frequent words removal

In [206]:
# tokenization
docs = gl.text_analytics.tokenize(sf['all_tips'])
# Bag-of-words
docs = gl.text_analytics.count_words(docs)
# Remove stop words
docs = docs.dict_trim_by_keys(gl.text_analytics.stopwords(), exclude=True)
# Remove less freq words
docs = docs.dict_trim_by_values(2)

###  3. Gnerate topics
* remove docs which has less than 3 keywords
* create a model
* check and evaluate

In [207]:
# remove docs which has less than 3 keywords
ix = docs.apply(lambda x:len(x.keys())>3)
docs_new = docs[ix]

In [209]:
# Show how many docs have been removed
print 1.0*len(docs_new)/len(docs)

0.320307281229


In [236]:
# create a topic model 
topic_model = gl.topic_model.create(docs_new,num_topics=30, num_iterations=200)

In [263]:
for i in range(30):
    print 'topic ',i,topic_model.get_topics(num_words=6,output_type='topic_words')[i]['words']

topic  0 ['coffee', 'great', 'staff', 'place', 'friendly', 'flat']
topic  1 ['food', 'market', 'good', 'place', 'cheese', 'lunch']
topic  2 ['museum', 'great', 'art', 'exhibition', 'collection', 'cafe']
topic  3 ['pub', 'beer', 'good', 'great', 'selection', 'beers']
topic  4 ['tea', 'good', 'london', 'service', 'afternoon', 'amazing']
topic  5 ['palace', 'queen', 'buckingham', 'royal', 'changing', 'british']
topic  6 ['food', 'great', 'service', 'wine', 'restaurant', 'menu']
topic  7 ['hotel', 'rooms', 'room', 'location', 'breakfast', 'staff']
topic  8 ['food', 'breakfast', 'good', 'great', 'eggs', 'place']
topic  9 ['chocolate', 'cake', 'delicious', 'amazing', 'london', 'cream']
topic  10 ['food', 'good', 'chicken', 'great', 'lunch', 'delicious']
topic  11 ['park', 'place', 'london', 'beautiful', 'walk', 'day']
topic  12 ['great', 'place', 'london', 'nice', 'gym', 'pool']
topic  13 ['london', 'big', 'nice', 'people', 'beautiful', 'walk']
topic  14 ['place', 'store', 'shop', 'street', 

In [255]:
# These are the 5th topic words
print topic_model.get_topics(num_words=10,output_type='topic_words')[5]

{'words': ['palace', 'queen', 'buckingham', 'royal', 'changing', 'british', 'guards', 'prince', 'reception', 'garden']}


In [256]:
docs_in_topic_5 = docs_new[topic_model.predict(docs_new)==5]

In [259]:
# select venues which are predicted to be topic 5
sf_new = sf[ix]
venue_in_topic5 = sf_new[topic_model.predict(docs_new)==5]

In [260]:
venue_in_topic5['category','name','tips']

category,name,tips
Palace,Buckingham Palace,[Skip the irrelevant change of the guards and ...
Plaza,Horse Guards Parade,[Along the middle path from the main archway it ...
Plaza,Speakers' Corner,[was great on Sunday listening to everyone ...
Landmark,Admiralty Arch,[John Prescott used to have a private flat h ...
Historic Site,Banqueting House,"[This was the only part of Whitehall Palace, one ..."
Art Gallery,Queen's House,"[Completed in 1638 by Inigo Jones, the house ..."
Hospital,Royal Hospital Chelsea,[The statue of Charles II in the square of the ...
Plaza,Fitzroy Square,[During the day in summer the 'private garden' in ...
Palace,Clarence House,[This royal residence was commissioned by the ...
Museum,National Army Museum,"[Amongst the uniforms, weapons and paintings on ..."
