Skip to content

friehl/DS205

Repository files navigation

YELP DATA COLLECTION SCRIPTS

=========

The goal of this script is to take a list of zip codes and retrieve all of the restaurants and corresponding reviews and load them into MongoDB. The scripts utilize the Yelp API, s3, scrapy and Mongo.

  • load_dataset.py: This script uploads the initial Yelp Dataset Challenge data into Mongo
  • bin/run.sh: Shell script that kicks off data collection:
  • YelpAPI/yelp_api.py: reads from the zips.csv file and looks up restaurants in that zip. Script defaults to grabbing 100 results. Data dumps to json files in api_output directory
  • YelpAPI/etl.py: Uploads json files fin the api_output directory to MongoDB. Script checks for duplicates before uploading. After uploading, unique business urls are outputed to url_output for review scraping
  • scrapy crawl yelp_review_skrape: Utilizes a Scrapy spider to grab all of the reviews. The crawler paginates to grab every review. The scraper is here: yelp_review_skrape/spiders/yelp_spider.py. First time using scrapy, it's pretty cool.
  • splitter.py: Script uploads scraped reviews to the Mongo database. A bit of data cleaning was needed
  • clean_up.py: This script takes all of the .json files created by the aforementioned processes and uploads them to s3 (the destination bucket is hard-coded in the script) and then puts them in local 'archive' folders in case you want to inspect what you just uploaded

To install all dependencies: '$ pip install -r requirements'

You will need to add the following files:

$ touch .boto # add your s3 credentials
$ touch ec2.conf # add your ec2 credentials if you want to connect to a remote db
$ touch yelp.conf # add your Yelp API credentials

Then, input the list of zip codes you want to query to the zips.csv file

$ bin/run.sh # This will initiate the api and scraper

Finally, if you want to dump all data in MongoDB, run:

$ bin/get_scraped_data.sh #

That's it! This isn't meant to be used extensively, but you can use this to get reviews from specific geos and do whatever nlp analysis you need.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors