# Case Study 2 : Data Science in Yelp Data

**Required Readings:** 
* [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge) 
* Please download the Yelp dataset from the above webpage.
* [TED Talks](https://www.ted.com/talks) for examples of 10 minutes talks.


** NOTE **
* Please don't forget to save the notebook frequently when working in Jupyter Notebook, otherwise the changes you made can be lost.

*----------------------

Here is an example of the data format. More details are included [here](https://www.yelp.com/dataset_challenge)

## Business Objects

Business objects contain basic information about local businesses. The fields are as follows:

```json
{
  'type': 'business',
  'business_id': (a unique identifier for this business),
  'name': (the full business name),
  'neighborhoods': (a list of neighborhood names, might be empty),
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': (latitude),
  'longitude': (longitude),
  'stars': (star rating, rounded to half-stars),
  'review_count': (review count),
  'photo_url': (photo url),
  'categories': [(localized category names)]
  'open': (is the business still open for business?),
  'schools': (nearby universities),
  'url': (yelp url)
}
```
## Checkin Objects
```json
{
    'type': 'checkin',
    'business_id': (encrypted business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict
}
```

# Problem: pick a data science problem that you plan to solve using Yelp Data
* The problem should be important and interesting, which has a potential impact in some area.
* The problem should be solvable using yelp data and data science solutions.

Please briefly describe in the following cell: what problem are you trying to solve? why this problem is important and interesting?

Often times when looking through reviews for restaurants in your area, the reviews do not take into account your taste in food. The problem we are trying to solve is how restaurant reviews or suggestions may not take into account your taste in food, or what type of restaurants you like, but rather general popularity to the average consumer.






















# Data Collection/Processing: 

In [2]:
#----------------------------------------------
# Your code starts here
#   Please add comments or text cells in between to explain the general idea of each block of the code.
#   Please feel free to add more cells below this cell if necessary








# Data Exploration: Exploring the Yelp Dataset

**(1) Finding the most popular business categories:** 
* print the top 10 most popular business categories in the dataset and their counts (i.e., how many business objects in each category). Here we say a category is "popular" if there are many business objects in this category (such as 'restaurants').

In [3]:
import json
from collections import Counter

business_categories = []
with open('yelp_academic_dataset_business.json') as fin:
    for line in fin:
        business_categories.append(json.loads(line)['categories'])

catcount = Counter()

for categories in business_categories:
    if categories is not None:
        for category in categories:
            catcount[category] += 1

print catcount.most_common()[:10]

[(u'Restaurants', 48485), (u'Shopping', 22466), (u'Food', 21189), (u'Beauty & Spas', 13711), (u'Home Services', 11241), (u'Nightlife', 10524), (u'Health & Medical', 10476), (u'Bars', 9087), (u'Automotive', 8554), (u'Local Services', 8133)]


** (2) Find the most popular business objects** 
* print the top 10 most popular business objects in the dataset and their counts (i.e., how many checkins in total for each business object).  Here we say a business object is "popular" if the business object attracts a large number of checkins from the users.

In [4]:
import json
import pprint

pp = pprint.PrettyPrinter(indent=3)

business_checkins = []
with open('yelp_academic_dataset_checkin.json') as fin:
    for line in fin:
        business_checkins.append((json.loads(line)['business_id'], list(json.loads(line)['time'])))

top = []
for business in business_checkins:
    count = 0
    for times in business[1]:
        count += int(times.split(':')[1])
    top.append((business[0], count))

top.sort(key=lambda x: x[1], reverse=True)
    
top10 = dict((x, y) for x, y in top[:10])

#print top10

business_objects = []
with open('yelp_academic_dataset_business.json') as fin:
    for line in fin:
        business = json.loads(line)
        if top10.has_key(business['business_id']):
            business_objects.append((business, top10[business['business_id']]))

business_objects.sort(key=lambda x: x[1], reverse=True)            

pp.pprint(business_objects)


[  (  {  u'address': u'5757 Wayne Newton Blvd',
         u'attributes': [u'WiFi: free'],
         u'business_id': u'FaHADZARwnY4yvlvpnsfGA',
         u'categories': [u'Hotels & Travel', u'Airports'],
         u'city': u'Las Vegas',
         u'hours': [  u'Monday 0:0-0:0',
                      u'Tuesday 0:0-0:0',
                      u'Wednesday 0:0-0:0',
                      u'Thursday 0:0-0:0',
                      u'Friday 0:0-0:0',
                      u'Saturday 0:0-0:0',
                      u'Sunday 0:0-0:0'],
         u'is_open': 1,
         u'latitude': 36.0850163303,
         u'longitude': -115.151009469,
         u'name': u'McCarran International Airport',
         u'neighborhood': u'Southeast',
         u'postal_code': u'89119',
         u'review_count': 2865,
         u'stars': 3.5,
         u'state': u'NV',
         u'type': u'business'},
      119204),
   (  {  u'address': u'3400 E Sky Harbor Blvd, Ste 3300',
         u'attributes': [u'WiFi: free'],
         u'busin

# The Solution: implement a data science solution to the problem you are trying to solve.

Briefly describe the idea of your solution to the problem in the following cell:

To solve this problem we decided to take a sample user, in this case a user with a lot of reviews from the Worcester, MA area to get relevant data. We then look at all of the positive reviews (greater than or equal to 4 stars) for that user and all of the business’ that the user has reviewed. We now have access to all of the reviews for all of the business’ that the user has reviewed. Then, in order to find other businesses that the program thinks the user would like, we looked at all the reviews for other businesses that other users who also gave good reviews to the same businesses that our user gave good reviews to. We run an algorithm on that list of reviews, and extract the top reviews based on star rating and number of reviews. We think that this would more accurately show the user what businesses they would like, rather than just picking businesses based on popularity from random reviewers that may have no similar tastes as the user. We think looking at reviews from users with similar tastes would be much more meaningful and relevant for a user seeking out new business’ to try.

Write codes to implement the solution in python:

In [5]:
import json
# please note that some of the cells will take a couple minutes to run because the algorithms we are doing are 
# O(n) time on lists if like 4.4 million reviews

In [6]:
# grab the first reviewer with 15 reviews; we didn't want an absurdly large number of reviews because
# the code would take several minutes to run
with open('yelp_academic_dataset_user.json') as f:
    for line in f:
        user = json.loads(line)
        if user['review_count'] == 15:
            print(user['user_id'])
            break

7FMT6gM7QiYAIv441cRdiA


In [7]:
# filter out all of the businesses that arn't restaurants and hash the business_id to the business_name,
# restaurant_long, and restaurant_lat
restaurants = set()
restaurant_names = {}
restaurant_long = {}
restaurant_lat = {}
with open('yelp_academic_dataset_business.json') as f:
    for line in f:
        business = json.loads(line)
        if business['categories'] and 'Restaurants' in business['categories']:
            restaurants.add(business['business_id'])
            restaurant_names[business['business_id']] = business['name']
            restaurant_long[business['business_id']] = business['longitude']
            restaurant_lat[business['business_id']] = business['latitude']

In [8]:
# grab all of the positive reviews for those businesses
all_reviews = []
with open('yelp_academic_dataset_review.json') as fin:
    for line in fin:
        review = json.loads(line)
        if review['business_id'] in restaurants and review['stars'] >= 4:
            all_reviews.append(review)

In [9]:
# user who happens to be in Pheonix, AZ, who has 15 reiviews
mock_user_id = '7FMT6gM7QiYAIv441cRdiA' 

# grabs all of the reviews for the mock user
user_reviews = []
for review in all_reviews: 
    if review['user_id'] == mock_user_id:
        user_reviews.append(review)
        
for review in user_reviews:
    print(review['text'])

Great Mexican food!  I had the shredded beef burro enchilada style and my husband had the cheese and fajita chicken quesadilla.  Both very tasty!  Very big cold beers too for a very reasonable price.  Nice atmosphere in the bar and close to home.
Durant's is a classy, old-style, American steakhouse.  We have been going there for 10+ years and have only been disappointed once.  We were not served the amazing bread with garlic butter and leeks - just rolls.  The waiter dropped a shrimp from my shrimp cocktail on the table and picked it up and stuck it back with his bare hands.  Then when we received our dinner we noted a rosemary sprig stuck into the mashed potatoes.  That never happened before.  Not a big deal, but we have never been big on foofy presentation and Durant's has never been like that.  Just really excellent service, cocktails and amazingly good food!  We stayed away for a while as it is a pretty good hike from home and didn't want to be disappointed again - after all, it's 

In [10]:
# grabs all the business id's of the businesses that the user reviewed
user_businesses_reviewed = []
for review in user_reviews:
    business_id = review['business_id']
    if business_id not in user_businesses_reviewed:
        user_businesses_reviewed.append(business_id)

In [11]:
# grabs all of the positive reviews for those businesses that the user also gave positive reviews to
positive_reviews = []
for review in all_reviews:
    if review['business_id'] in user_businesses_reviewed:
        positive_reviews.append(review)

In [12]:
# grabs all of the user ID's for those reviews of the businesses that the people who also gave positive reviews to
# the same businesses that the user also gave positive reviews to
positive_users = []
for review in positive_reviews:
    user_id = review['user_id']
    if user_id not in positive_users:
        positive_users.append(user_id)

In [13]:
# clears memory because the memory fills up
user_businesses_reviewed = None
positive_reviews = None
restaurants = None

In [14]:
# grabs all the businesses for those positive reviews for those users who gave positive reviews to those businesses 
# that the user also gave positive reviews to that are not in the list of businesses that the user has already 
# reviewed, and accumulates the star ratings for those businesses; both of these values get put into a tupil
positive_businesses = {}
for review in all_reviews:
    if review['user_id'] in positive_users and review['review_id'] not in user_reviews:
        business_id = review['business_id']
        stars = review['stars']
        if business_id in positive_businesses:
            positive_businesses[business_id] += stars
        else:
            positive_businesses[business_id] = stars

In [1]:
# grabs the top five businesses that have the strongest weighted star ratings
import operator
top5 = sorted(positive_businesses.items(), key=operator.itemgetter(1), reverse=True)[:5]

# grab the longitude and latitude of the five businesses
long_and_lat = [(restaurant_long[r_id], restaurant_lat[r_id]) for r_id, stars in top5]

NameError: name 'positive_businesses' is not defined

# Results: summarize and visualize the results discovered from the analysis

Please use figures, tables, or videos to communicate the results with the audience.


In [16]:
# displays the top 5 restaurants witht their corresponding weighted ratings
print("The format is (name, accumulated star rating):")
print([(restaurant_names[r_id], stars) for r_id, stars in top5])

# display the furthest distance between any two of the restaurants in the user's top five suggested restaurants
import math

max_long = 0
min_long = 0
max_lat = 0
min_lat = 0
max_difference = 0

# find two furthest restaurants
for long1, lat1 in long_and_lat:
    for long2, lat2 in long_and_lat:
        if math.sqrt(math.pow((long2-long1), 2)+math.pow((lat2-lat1), 2))>max_difference:
            max_long=long2
            min_long=long1
            max_lat=lat2
            min_lat=lat1

# convert the distance to miles
C = (math.sin(min_lat/57.3)*math.sin(max_lat/57.3)+math.cos(min_lat/57.3)*
    math.cos(max_lat/57.3)*math.cos(max_long/57.3-min_long/57.3))
Distance = 3959 * math.acos(C)
print("The furthest distance between any of the two restaurants is "+'%.2f' % Distance+" miles.")

The format is (name, accumulated star rating):
[(u"Durant's", 2420), (u'The Wigwam', 530), (u'Cheuvront', 487), (u'Cibo', 325), (u'FEZ', 317)]
The furthest distance between any of the two restaurants is 2.56 miles.


*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this Jupyter notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "jupyter notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.

* **PPT Slides**: please prepare PPT slides (for 10 minutes' talk) to present about the case study . Each team present their case studies in class for 10 minutes.

Please compress all the files in a zipped file.


** How to submit: **

        Please submit through Canvas, in the Assignment "Case Study 2".
        
** Note: Each team only needs to submit one submission in Canvas **


# Peer-Review Grading Template:

** Total Points: (100 points) ** Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.

Please add an "**X**" mark in front of your rating: 

For example:

*2: bad*
          
**X** *3: good*
    
*4: perfect*


    ---------------------------------
    The Problem: 
    ---------------------------------
    
    1. (10 points) how well did the team describe the problem they are trying to solve using the data? 
       0: not clear
       2: I can barely understand the problem
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
    
    2. (10 points) do you think the problem is important or has a potential impact?
        0: not important at all
        2: not sure if it is important
        4: seems important, but not clear
        6: interesting problem
        8: an important problem, which I want to know the answer myself
       10: very important, I would be happy invest money on a project like this.
    
    ----------------------------------
    Data Collection and Processing:
    ----------------------------------
    
    3. (10 points) Do you think the data collected/processed are relevant and sufficient for solving the above problem? 
       0: not clear
       2: I can barely understand what data they are trying to collect/process
       4: I can barely understand why the data is relevant to the problem
       6: the data are relevant to the problem, but better data can be collected
       8: the data collected are relevant and at a proper scale
      10: the data are properly collected and they are sufficient

    -----------------------------------
    Data Exploration:
    -----------------------------------
    4. How well did the team solve the following task:
    
    (1) Finding the most popular business categories (5 points):
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    (2) Find the most popular business objects (5 points)
       0: missing answer
       1: okay, but with major problems
       3: good, but with minor problems
       5: perfect
    
    -----------------------------------
    The Solution
    -----------------------------------
    5.  how well did the team describe the solution they used to solve the problem? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
       10: crystal clear
       
    6. how well is the solution in solving the problem? (10 points)
       0: not relevant
       2: barely relevant to the problem
       4: okay solution, but there is an easier solution.
       6: good, but can be improved
       8: very good, but solution is simple/old
       10: innovative and technically sound
       
    7. how well did the team implement the solution in python? (10 points)
       0: the code is not relevant to the solution proposed
       2: the code is barely understandable, but not relevant
       4: okay, the code is clear but incorrect
       6: good, the code is correct, but with major errors
       8: very good, the code is correct, but with minor errors
      10: perfect 
   
    -----------------------------------
    The Results
    -----------------------------------
     8.  How well did the team present the results they found in the data? (10 points)
       0: not clear
       2: I can barely understand
       4: okay, can be improved
       6: good, but can be improved
       8: very good
      10: crystal clear
       
     9.  How do you think of the results they found in the data?  (5 points)
       0: not clear
       1: likely to be wrong
       2: okay, maybe wrong
       3: good, but can be improved
       4: make sense, but not interesting
       5: make sense and very interesting
     
    -----------------------------------
    The Presentation
    -----------------------------------
    10. How all the different parts (data, problem, solution, result) fit together as a coherent story?  
       0: they are irrelevant
       1: I can barely understand how they are related to each other
       2: okay, the problem is good, but the solution doesn't match well, or the problem is not solvable.
       3: good, but the results don't make much sense in the context
       4: very good fit, but not exciting (the storyline can be improved/polished)
       5: a perfect story
      
    11. Did the presenter make good use of the 10 minutes for presentation?  
       0: the team didn't present
       1: bad, barely finished a small part of the talk
       2: okay, barely finished most parts of the talk.
       3: good, finished all parts of the talk, but some part is rushed
       4: very good, but the allocation of time on different parts can be improved.
       5: perfect timing and good use of time      

    12. How well do you think of the presentation (overall quality)?  
       0: the team didn't present
       1: bad
       2: okay
       3: good
       4: very good
       5: perfect


    -----------------------------------
    Overall: 
    -----------------------------------
    13. How many points out of the 100 do you give to this project in total?  Please don't worry about the absolute scores, we will rescale the final grading according to the performance of all teams in the class.
    Total score:
    
    14. What are the strengths of this project? Briefly, list up to 3 strengths.
       1: 
       2:
       3:
    
    15. What are the weaknesses of this project? Briefly, list up to 3 weaknesses.
       1:
       2:
       3:
    
    16. Detailed comments and suggestions. What suggestions do you have for this project to improve its quality further.
    
    
    

    ---------------------------------
    Your Vote: 
    ---------------------------------
    1. [Overall Quality] Between the two submissions that you are reviewing, which team would you vote for a better score?  (5 bonus points)
        0: I vote the other team is better than this team
        5: I vote this team is better than the other team 
        
    2. [Presentation] Among all the teams in the presentation, which team do you think deserves the best presentation award for this case study?  
        1: Team 1
        2: Team 2
        3: Team 3
        4: Team 4
        5: Team 5
        6: Team 6
        7: Team 7
        8: Team 8
        9: Team 9


