Skip to content

Latest commit

 

History

History
449 lines (286 loc) · 11.5 KB

PITCHME.md

File metadata and controls

449 lines (286 loc) · 11.5 KB

@title[Cover Page]

Recommendation!

Creating a recommendator for Openrice users



Lyoe Lee - Dec 2017


@title[Outline]

Openrice Recommendation

Goal: Try to recommend a restaurant to users

@fa[arrow-down](Press down for more...)

+++

What to Eat?


Familiar with Openrice?


@fa[arrow-down]

+++

Who?

Anybody who loves to eat (and rate)


@title[Summary - Part 1]

Goals?


- Predict the right restaurants to users |

@fa[arrow-down]

+++

Steps?


  • Scrape top rated restuarants from Openrice |

@fa[arrow-down]

+++?image=assets/pic/openrice_restaurant_search_first.png

+++?image=assets/pic/openrice_restaurant_search.png

+++

Steps?


  • Scrape top rated restuarants from Openrice
  • Get each reviews and ratings |

@fa[arrow-down]

+++?image=assets/pic/openrice_user_review_first.png

+++?image=assets/pic/openrice_user_review.png


@title[Summary - Part 2 (Data)]

Data!


  • Scrape all these data down |
  • Before showing the data... |

@title[Summary - Part 3 (Challenges)]

Challenge 1...!?


- Learn [Scrapy](https://scrapy.org/)! |

@fa[arrow-down]

+++

Scrapy 101: Some Codes

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('reviewresult.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

class ExtractSpider(scrapy.Spider):
    name = "Extract"
    start_urls = review_urls

    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
        'FEED_FORMAT':'json',                                 # Used for pipeline 2
        'FEED_URI': 'reviewresult.json'                        # Used for pipeline 2
    }
    
    def parse(self, response):
        for user in response.xpath('//*[@class="sr2-review-list-container full clearfix js-sr2-review-list-container"]'):
            yield {
                # https://stackoverflow.com/questions/20081024/scrapy-get-request-url-in-parse
                'link': response.url,
                'user': user.xpath('div[1]/section/div[1]/a/text()').extract(),
                'user_url' : user.xpath('div[1]/section/div[1]/a/@href').extract(),
                'rating': user.xpath('div[2]/section/div[1]/div[1]/div').extract(),
                'taste_star_1': user.xpath('div[3]/section/div[2]/div[2]/span[1]').extract(),
                'taste_star_2': user.xpath('div[3]/section/div[2]/div[2]/span[2]').extract(),
                'taste_star_3': user.xpath('div[3]/section/div[2]/div[2]/span[3]').extract(),
                'taste_star_4': user.xpath('div[3]/section/div[2]/div[2]/span[4]').extract(),
                'taste_star_5': user.xpath('div[3]/section/div[2]/div[2]/span[5]').extract(),
                'decor_star_1': user.xpath('div[3]/section/div[3]/div[2]/span[1]').extract(),               
                'decor_star_2': user.xpath('div[3]/section/div[3]/div[2]/span[2]').extract(),              
                'decor_star_3': user.xpath('div[3]/section/div[3]/div[2]/span[3]').extract(),               
                'decor_star_4': user.xpath('div[3]/section/div[3]/div[2]/span[4]').extract(),                
                'decor_star_5': user.xpath('div[3]/section/div[3]/div[2]/span[5]').extract(),                
                'service_star_1': user.xpath('div[3]/section/div[4]/div[2]/span[1]').extract(),                                
                'service_star_2': user.xpath('div[3]/section/div[4]/div[2]/span[2]').extract(),                                
                'service_star_3': user.xpath('div[3]/section/div[4]/div[2]/span[3]').extract(),                                
                'service_star_4': user.xpath('div[3]/section/div[4]/div[2]/span[4]').extract(),                                
                'service_star_5': user.xpath('div[3]/section/div[4]/div[2]/span[5]').extract(),                                                
                'hygiene_star_1': user.xpath('div[3]/section/div[5]/div[2]/span[1]').extract(),                                                
                'hygiene_star_2': user.xpath('div[3]/section/div[5]/div[2]/span[2]').extract(),                                                
                'hygiene_star_3': user.xpath('div[3]/section/div[5]/div[2]/span[3]').extract(),                                                
                'hygiene_star_4': user.xpath('div[3]/section/div[5]/div[2]/span[4]').extract(),                                                
                'hygiene_star_5': user.xpath('div[3]/section/div[5]/div[2]/span[5]').extract(),                                                
                'value_star_1': user.xpath('div[3]/section/div[6]/div[2]/span[1]').extract(),                                                                
                'value_star_2': user.xpath('div[3]/section/div[6]/div[2]/span[2]').extract(),                                                                
                'value_star_3': user.xpath('div[3]/section/div[6]/div[2]/span[3]').extract(),                                                                
                'value_star_4': user.xpath('div[3]/section/div[6]/div[2]/span[4]').extract(),                                                                
                'value_star_5': user.xpath('div[3]/section/div[6]/div[2]/span[5]').extract(),
            }
            
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'
})

process.crawl(ExtractSpider)
process.start()

@[1-13](Define a class, saving in JSON) @[14-24](Define a Extracting Spider, defining the URLs to scrape) @[25-59](Defining the xpath/css to scrape and store in JSON) @[60-65](Actual scraping of websites)

+++

Challenge 1...!?


  • Learn Scrapy!
  • Openrice limits...|
    • Only up to 17 pages per Region! |
    • So I did only 3 regions (Hong Kong, Kowloon, New Territories) |
  • Bad html consistency...

@fa[arrow-down]

+++?image=assets/pic/openrice_17_page_limit.png

+++

Challenge 1...!?


In the end...

  • 600+ Top reviewed restaurants |
  • 160,000+ individual user reviews |
  • 8000+ separate URLs |

@fa[arrow-down]

+++

Challenge 1...!?


  • Learn Scrapy!
  • Openrice limits...
    • Only up to 17 pages per Region!
    • So I did only 3 regions (Hong Kong, Kowloon, New Territories)
  • Bad html consistency... |

@fa[arrow-down]

+++?image=assets/pic/openrice_html_inconsistency.png

+++?image=assets/pic/openrice_html_stars.png


@title[Summary - Data Tables]

DATA

+++?image=assets/pic/restaurant_table.png

+++?image=assets/pic/openrice_restaurant_search.png

+++?image=assets/pic/user_table.png

+++?image=assets/pic/openrice_user_review.png


@title[Model Insights: Visualization]

Exploratory Data Analysis (EDA)


Restaurant overall

  • District |
  • Prices |
  • Review Count |
  • Overall Score |

@fa[arrow-down]

+++?image=assets/pic/restaurant_district.png

+++?image=assets/pic/restarant_price.png

+++?image=assets/pic/restaurant_review_count.png

+++?image=assets/pic/restaurant_review_table.png

+++?image=assets/pic/restaurant_overall_score.png

+++?image=assets/pic/restaurant_score_table.png


Exploratory Data Analysis (EDA)


#### User reviews
@fa[arrow-down]

+++?image=assets/pic/user_district.png +++?image=assets/pic/restaurant_district.png


@title[Modelling Approach]

Challenge 2.. ?!



#### Recommender system Models

+++

Content Based

If you are browsing at a gold T-shirt , recommend other T-shirts or gold sweater


+++

Collaborative Filtering

  • User based |
    • Users similar to me also looked at these items |

  • Item based |
    • Users who looked at my item also looked at these other items |

+++

Hybrid

+++

Collaborative Filtering...?!?

+++

Memory based Algo

  • Calculate how "similar" a pair of users/items are |
  • Using metrics such as "cosine" etc. from sklearn to measure "distance/similarity" | @fa[frown-o]
  • Doesn't scale to real-world scenarios |
  • Bad at dealing with Sparse matrix |

+++

Model based Algo

  • Matrix factorization |
  • Unsupervised learning method |
  • Deal better with scalability and sparsity |

+++

Sparcity??


+++?image=assets/pic/sparse_matrix.png

+++

Model-based Collaborative Filtering

Singular Value Decomposition (SVD)!


Similar to PCA:

(A method for) Dimensionality reduction


Scipy package: svds


@title[Model Results]

Model Results:

+++

Predictions/Recommendations :


Take user 'supersupergirl'


+++?image=assets/pic/supersupergirl_prediction_results.png&size=contain

+++?image=assets/pic/mandymanlovefoodie_prediction_results.png&size=contain


@title[Conclusion]

Conclusions:

Recommendation is HARD


@title[Future/Next Steps]

Next Steps:

Build 2.0:

Build in penalties for Districts/Prices/Categories

+++?image=assets/pic/mandymanlovefoodie_prediction_results_2.0.png&size=contain

+++

Play around with hyperparameters, grid search etc.

A high 'k' in SVD brings better results for some users

+++

Final playing around (before I slept...) :

Out of the top 50 users who submitted the most reviews, the model recommended 842 out of 1141 restaurants (1141 restaurants from testing set)


Thank you!!