@title[Cover Page]
Lyoe Lee - Dec 2017
@title[Outline]
@fa[arrow-down](Press down for more...)
+++
@fa[arrow-down]
+++
Anybody who loves to eat (and rate)
@title[Summary - Part 1]
- Predict the right restaurants to users |
@fa[arrow-down]
+++
- Scrape top rated restuarants from Openrice |
@fa[arrow-down]
+++?image=assets/pic/openrice_restaurant_search_first.png
+++?image=assets/pic/openrice_restaurant_search.png
+++
- Scrape top rated restuarants from Openrice
- Get each reviews and ratings |
@fa[arrow-down]
+++?image=assets/pic/openrice_user_review_first.png
+++?image=assets/pic/openrice_user_review.png
@title[Summary - Part 2 (Data)]
- Scrape all these data down |
- Before showing the data... |
@title[Summary - Part 3 (Challenges)]
- Learn [Scrapy](https://scrapy.org/)! |
@fa[arrow-down]
+++
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('reviewresult.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
class ExtractSpider(scrapy.Spider):
name = "Extract"
start_urls = review_urls
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
'FEED_FORMAT':'json', # Used for pipeline 2
'FEED_URI': 'reviewresult.json' # Used for pipeline 2
}
def parse(self, response):
for user in response.xpath('//*[@class="sr2-review-list-container full clearfix js-sr2-review-list-container"]'):
yield {
# https://stackoverflow.com/questions/20081024/scrapy-get-request-url-in-parse
'link': response.url,
'user': user.xpath('div[1]/section/div[1]/a/text()').extract(),
'user_url' : user.xpath('div[1]/section/div[1]/a/@href').extract(),
'rating': user.xpath('div[2]/section/div[1]/div[1]/div').extract(),
'taste_star_1': user.xpath('div[3]/section/div[2]/div[2]/span[1]').extract(),
'taste_star_2': user.xpath('div[3]/section/div[2]/div[2]/span[2]').extract(),
'taste_star_3': user.xpath('div[3]/section/div[2]/div[2]/span[3]').extract(),
'taste_star_4': user.xpath('div[3]/section/div[2]/div[2]/span[4]').extract(),
'taste_star_5': user.xpath('div[3]/section/div[2]/div[2]/span[5]').extract(),
'decor_star_1': user.xpath('div[3]/section/div[3]/div[2]/span[1]').extract(),
'decor_star_2': user.xpath('div[3]/section/div[3]/div[2]/span[2]').extract(),
'decor_star_3': user.xpath('div[3]/section/div[3]/div[2]/span[3]').extract(),
'decor_star_4': user.xpath('div[3]/section/div[3]/div[2]/span[4]').extract(),
'decor_star_5': user.xpath('div[3]/section/div[3]/div[2]/span[5]').extract(),
'service_star_1': user.xpath('div[3]/section/div[4]/div[2]/span[1]').extract(),
'service_star_2': user.xpath('div[3]/section/div[4]/div[2]/span[2]').extract(),
'service_star_3': user.xpath('div[3]/section/div[4]/div[2]/span[3]').extract(),
'service_star_4': user.xpath('div[3]/section/div[4]/div[2]/span[4]').extract(),
'service_star_5': user.xpath('div[3]/section/div[4]/div[2]/span[5]').extract(),
'hygiene_star_1': user.xpath('div[3]/section/div[5]/div[2]/span[1]').extract(),
'hygiene_star_2': user.xpath('div[3]/section/div[5]/div[2]/span[2]').extract(),
'hygiene_star_3': user.xpath('div[3]/section/div[5]/div[2]/span[3]').extract(),
'hygiene_star_4': user.xpath('div[3]/section/div[5]/div[2]/span[4]').extract(),
'hygiene_star_5': user.xpath('div[3]/section/div[5]/div[2]/span[5]').extract(),
'value_star_1': user.xpath('div[3]/section/div[6]/div[2]/span[1]').extract(),
'value_star_2': user.xpath('div[3]/section/div[6]/div[2]/span[2]').extract(),
'value_star_3': user.xpath('div[3]/section/div[6]/div[2]/span[3]').extract(),
'value_star_4': user.xpath('div[3]/section/div[6]/div[2]/span[4]').extract(),
'value_star_5': user.xpath('div[3]/section/div[6]/div[2]/span[5]').extract(),
}
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'
})
process.crawl(ExtractSpider)
process.start()
@[1-13](Define a class, saving in JSON) @[14-24](Define a Extracting Spider, defining the URLs to scrape) @[25-59](Defining the xpath/css to scrape and store in JSON) @[60-65](Actual scraping of websites)
+++
- Learn Scrapy!
- Openrice limits...|
- Only up to 17 pages per Region! |
- So I did only 3 regions (Hong Kong, Kowloon, New Territories) |
- Bad html consistency...
@fa[arrow-down]
+++?image=assets/pic/openrice_17_page_limit.png
+++
- 600+ Top reviewed restaurants |
- 160,000+ individual user reviews |
- 8000+ separate URLs |
@fa[arrow-down]
+++
- Learn Scrapy!
- Openrice limits...
- Only up to 17 pages per Region!
- So I did only 3 regions (Hong Kong, Kowloon, New Territories)
- Bad html consistency... |
@fa[arrow-down]
+++?image=assets/pic/openrice_html_inconsistency.png
+++?image=assets/pic/openrice_html_stars.png
@title[Summary - Data Tables]
+++?image=assets/pic/restaurant_table.png
+++?image=assets/pic/openrice_restaurant_search.png
+++?image=assets/pic/user_table.png
+++?image=assets/pic/openrice_user_review.png
@title[Model Insights: Visualization]
- District |
- Prices |
- Review Count |
- Overall Score |
@fa[arrow-down]
+++?image=assets/pic/restaurant_district.png
+++?image=assets/pic/restarant_price.png
+++?image=assets/pic/restaurant_review_count.png
+++?image=assets/pic/restaurant_review_table.png
+++?image=assets/pic/restaurant_overall_score.png
+++?image=assets/pic/restaurant_score_table.png
#### User reviews
@fa[arrow-down]
+++?image=assets/pic/user_district.png +++?image=assets/pic/restaurant_district.png
@title[Modelling Approach]
#### Recommender system Models
+++
If you are browsing at a gold T-shirt , recommend other T-shirts or gold sweater
+++
- User based |
- Users similar to me also looked at these items |
- Item based |
- Users who looked at my item also looked at these other items |
+++
+++
+++
- Calculate how "similar" a pair of users/items are |
- Using metrics such as "cosine" etc. from sklearn to measure "distance/similarity" | @fa[frown-o]
- Doesn't scale to real-world scenarios |
- Bad at dealing with Sparse matrix |
+++
- Matrix factorization |
- Unsupervised learning method |
- Deal better with scalability and sparsity |
+++
+++?image=assets/pic/sparse_matrix.png
+++
Singular Value Decomposition (SVD)!
Similar to PCA:
(A method for) Dimensionality reduction
Scipy package: svds
@title[Model Results]
+++
Take user 'supersupergirl'
+++?image=assets/pic/supersupergirl_prediction_results.png&size=contain
+++?image=assets/pic/mandymanlovefoodie_prediction_results.png&size=contain
@title[Conclusion]
Recommendation is HARD
@title[Future/Next Steps]
Build 2.0:
Build in penalties for Districts/Prices/Categories
+++?image=assets/pic/mandymanlovefoodie_prediction_results_2.0.png&size=contain
+++
Play around with hyperparameters, grid search etc.
A high 'k' in SVD brings better results for some users
+++
Final playing around (before I slept...) :
Out of the top 50 users who submitted the most reviews, the model recommended 842 out of 1141 restaurants (1141 restaurants from testing set)