@title[Cover Page]

Recommendation!

Creating a recommendator for Openrice users

Lyoe Lee - Dec 2017

@title[Outline]

Openrice Recommendation

Goal: Try to recommend a restaurant to users

@fa[arrow-down](Press down for more...)

+++

What to Eat?

Familiar with Openrice?

@fa[arrow-down]

+++

Who?

Anybody who loves to eat (and rate)

@title[Summary - Part 1]

Goals?

- Predict the right restaurants to users |

@fa[arrow-down]

+++

Steps?

Scrape top rated restuarants from Openrice |

@fa[arrow-down]

+++?image=assets/pic/openrice_restaurant_search_first.png

+++?image=assets/pic/openrice_restaurant_search.png

+++

Steps?

Scrape top rated restuarants from Openrice
Get each reviews and ratings |

@fa[arrow-down]

+++?image=assets/pic/openrice_user_review_first.png

+++?image=assets/pic/openrice_user_review.png

@title[Summary - Part 2 (Data)]

Data!

Scrape all these data down |
Before showing the data... |

@title[Summary - Part 3 (Challenges)]

Challenge 1...!?

- Learn [Scrapy](https://scrapy.org/)! |

@fa[arrow-down]

+++

Scrapy 101: Some Codes

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('reviewresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

class ExtractSpider(scrapy.Spider):
    name = "Extract"
    start_urls = review_urls

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'reviewresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for user in response.xpath('//*[@class="sr2-review-list-container full clearfix js-sr2-review-list-container"]'):
            yield {
                # https://stackoverflow.com/questions/20081024/scrapy-get-request-url-in-parse
                'link': response.url,
                'user': user.xpath('div[1]/section/div[1]/a/text()').extract(),
                'user_url' : user.xpath('div[1]/section/div[1]/a/@href').extract(),
                'rating': user.xpath('div[2]/section/div[1]/div[1]/div').extract(),
                'taste_star_1': user.xpath('div[3]/section/div[2]/div[2]/span[1]').extract(),
                'taste_star_2': user.xpath('div[3]/section/div[2]/div[2]/span[2]').extract(),
                'taste_star_3': user.xpath('div[3]/section/div[2]/div[2]/span[3]').extract(),
                'taste_star_4': user.xpath('div[3]/section/div[2]/div[2]/span[4]').extract(),
                'taste_star_5': user.xpath('div[3]/section/div[2]/div[2]/span[5]').extract(),
                'decor_star_1': user.xpath('div[3]/section/div[3]/div[2]/span[1]').extract(),               
                'decor_star_2': user.xpath('div[3]/section/div[3]/div[2]/span[2]').extract(),              
                'decor_star_3': user.xpath('div[3]/section/div[3]/div[2]/span[3]').extract(),               
                'decor_star_4': user.xpath('div[3]/section/div[3]/div[2]/span[4]').extract(),                
                'decor_star_5': user.xpath('div[3]/section/div[3]/div[2]/span[5]').extract(),                
                'service_star_1': user.xpath('div[3]/section/div[4]/div[2]/span[1]').extract(),                                
                'service_star_2': user.xpath('div[3]/section/div[4]/div[2]/span[2]').extract(),                                
                'service_star_3': user.xpath('div[3]/section/div[4]/div[2]/span[3]').extract(),                                
                'service_star_4': user.xpath('div[3]/section/div[4]/div[2]/span[4]').extract(),                                
                'service_star_5': user.xpath('div[3]/section/div[4]/div[2]/span[5]').extract(),                                                
                'hygiene_star_1': user.xpath('div[3]/section/div[5]/div[2]/span[1]').extract(),                                                
                'hygiene_star_2': user.xpath('div[3]/section/div[5]/div[2]/span[2]').extract(),                                                
                'hygiene_star_3': user.xpath('div[3]/section/div[5]/div[2]/span[3]').extract(),                                                
                'hygiene_star_4': user.xpath('div[3]/section/div[5]/div[2]/span[4]').extract(),                                                
                'hygiene_star_5': user.xpath('div[3]/section/div[5]/div[2]/span[5]').extract(),                                                
                'value_star_1': user.xpath('div[3]/section/div[6]/div[2]/span[1]').extract(),                                                                
                'value_star_2': user.xpath('div[3]/section/div[6]/div[2]/span[2]').extract(),                                                                
                'value_star_3': user.xpath('div[3]/section/div[6]/div[2]/span[3]').extract(),                                                                
                'value_star_4': user.xpath('div[3]/section/div[6]/div[2]/span[4]').extract(),                                                                
                'value_star_5': user.xpath('div[3]/section/div[6]/div[2]/span[5]').extract(),
            }
            
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'
})

process.crawl(ExtractSpider)
process.start()

@[1-13](Define a class, saving in JSON) @[14-24](Define a Extracting Spider, defining the URLs to scrape) @[25-59](Defining the xpath/css to scrape and store in JSON) @[60-65](Actual scraping of websites)

+++

Challenge 1...!?

Learn Scrapy!
Openrice limits...|
- Only up to 17 pages per Region! |
- So I did only 3 regions (Hong Kong, Kowloon, New Territories) |
Bad html consistency...

@fa[arrow-down]

+++?image=assets/pic/openrice_17_page_limit.png

+++

Challenge 1...!?

In the end...

600+ Top reviewed restaurants |
160,000+ individual user reviews |
8000+ separate URLs |

@fa[arrow-down]

+++

Challenge 1...!?

Learn Scrapy!
Openrice limits...
- Only up to 17 pages per Region!
- So I did only 3 regions (Hong Kong, Kowloon, New Territories)
Bad html consistency... |

@fa[arrow-down]

+++?image=assets/pic/openrice_html_inconsistency.png

+++?image=assets/pic/openrice_html_stars.png

@title[Summary - Data Tables]

DATA

+++?image=assets/pic/restaurant_table.png

+++?image=assets/pic/openrice_restaurant_search.png

+++?image=assets/pic/user_table.png

+++?image=assets/pic/openrice_user_review.png

@title[Model Insights: Visualization]

Exploratory Data Analysis (EDA)

Restaurant overall

District |
Prices |
Review Count |
Overall Score |

@fa[arrow-down]

+++?image=assets/pic/restaurant_district.png

+++?image=assets/pic/restarant_price.png

+++?image=assets/pic/restaurant_review_count.png

+++?image=assets/pic/restaurant_review_table.png

+++?image=assets/pic/restaurant_overall_score.png

+++?image=assets/pic/restaurant_score_table.png

Exploratory Data Analysis (EDA)

#### User reviews
@fa[arrow-down]

+++?image=assets/pic/user_district.png +++?image=assets/pic/restaurant_district.png

@title[Modelling Approach]

Challenge 2.. ?!

#### Recommender system Models

+++

Content Based

If you are browsing at a gold T-shirt , recommend other T-shirts or gold sweater

+++

Collaborative Filtering

User based |
- Users similar to me also looked at these items |

Item based |
- Users who looked at my item also looked at these other items |

+++

Hybrid

+++

Collaborative Filtering...?!?

+++

Memory based Algo

Calculate how "similar" a pair of users/items are |
Using metrics such as "cosine" etc. from sklearn to measure "distance/similarity" | @fa[frown-o]
Doesn't scale to real-world scenarios |
Bad at dealing with Sparse matrix |

+++

Model based Algo

Matrix factorization |
Unsupervised learning method |
Deal better with scalability and sparsity |

+++

Sparcity??

+++?image=assets/pic/sparse_matrix.png

+++

Model-based Collaborative Filtering

Singular Value Decomposition (SVD)!

Similar to PCA:

(A method for) Dimensionality reduction

Scipy package: svds

@title[Model Results]

Model Results:

+++

Predictions/Recommendations :

Take user 'supersupergirl'

+++?image=assets/pic/supersupergirl_prediction_results.png&size=contain

+++?image=assets/pic/mandymanlovefoodie_prediction_results.png&size=contain

@title[Conclusion]

Conclusions:

Recommendation is HARD

@title[Future/Next Steps]

Next Steps:

Build 2.0:

Build in penalties for Districts/Prices/Categories

+++?image=assets/pic/mandymanlovefoodie_prediction_results_2.0.png&size=contain

+++

Play around with hyperparameters, grid search etc.

A high 'k' in SVD brings better results for some users

+++

Final playing around (before I slept...) :

Out of the top 50 users who submitted the most reviews, the model recommended 842 out of 1141 restaurants (1141 restaurants from testing set)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PITCHME.md

PITCHME.md

Recommendation!

Creating a recommendator for Openrice users

Openrice Recommendation

Goal: Try to recommend a restaurant to users

What to Eat?

Familiar with Openrice?

Who?

Goals?

Steps?

Steps?

Data!

Challenge 1...!?

Scrapy 101: Some Codes

Challenge 1...!?

Challenge 1...!?

In the end...

Challenge 1...!?

DATA

Exploratory Data Analysis (EDA)

Restaurant overall

Exploratory Data Analysis (EDA)

Challenge 2.. ?!

Content Based

Collaborative Filtering

Hybrid

Collaborative Filtering...?!?

Memory based Algo

Model based Algo

Sparcity??

Model-based Collaborative Filtering

Model Results:

Predictions/Recommendations :

Conclusions:

Next Steps:

Thank you!!

Files

PITCHME.md

Latest commit

History

PITCHME.md

File metadata and controls

Recommendation!

Creating a recommendator for Openrice users

Openrice Recommendation

Goal: Try to recommend a restaurant to users

What to Eat?

Familiar with Openrice?

Who?

Goals?

Steps?

Steps?

Data!

Challenge 1...!?

Scrapy 101: Some Codes

Challenge 1...!?

Challenge 1...!?

In the end...

Challenge 1...!?

DATA

Exploratory Data Analysis (EDA)

Restaurant overall

Exploratory Data Analysis (EDA)

Challenge 2.. ?!

Content Based

Collaborative Filtering

Hybrid

Collaborative Filtering...?!?

Memory based Algo

Model based Algo

Sparcity??

Model-based Collaborative Filtering

Model Results:

Predictions/Recommendations :

Conclusions:

Next Steps:

Thank you!!