## Project Title: 
#### Exploration of chances and likely amount of tips to the servers in restaurants.


### Abstract:
We are using datasets from the Yelp Dataset Challenge.

Some questions we are addressing include:
- Tips by day of week
- Tips by frequent patrons
- What kinds of correlations can be found between tips and reviews?
- Variance of tips amount by same patrons over time. Is their a warm-start bias?
- How rating is correlated with tips? How ratings translate to actual tips?
- We may be able to explain these using social behaviour

Depends on how much we can do within the time-frame.

### Data:


For this project, we shall be using publicly-available datasets from the Yelp Dataset Challenge.
Data Source: http://www.yelp.com/dataset_challenge

Yelp provides detailed service connecting patrons to most relevant local businesses based on preferences and constraints. Yelp search engine has sift through over 61 million reviews by patrons and developed this dataset that is deep, rich and entirely from the real-life.
        
The Challenge Dataset:

1.6M reviews and 500K tips by 366K users for 61K businesses
481K business attributes, e.g., hours, parking availability, ambience.
Social network of 366K users for a total of 2.9M social edges.
Aggregated check-ins over time for each of the 61K businesses


Cities:
U.K.: Edinburgh
Germany: Karlsruhe
Canada: Montreal and Waterloo
U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison


Data cleaning:

1. Only US cities: Pittsburgh and Las Vegas,
2. Filter out other columns not useful
3. We need only restaurant data


In [2]:
import pandas as pd
import numpy as np
import matplotlib
%matplotlib nbagg
import urllib2
import csv
import json
from pprint import pprint

file = 'yelp_academic_dataset_business.json'
ydf = pd.read_table(file)
pprint (ydf)

      {"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 9, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}
0      {"business_id": "UsFtqoBl7naz8AVUBZMjQQ", "ful...                                                                                                                                                                                                                                                                        



In [None]:
# -*- coding: utf-8 -*-
"""Convert the Yelp Dataset Challenge dataset from json format to csv.
"""
import collections
import csv
import simplejson as json


def read_and_write_file(json_file_path, csv_file_path, column_names):
    """Read in the json dataset file and write it out to a csv file, given the column names."""
    with open(csv_file_path, 'wb+') as fout:
        csv_file = csv.writer(fout)
        csv_file.writerow(list(column_names))
        with open(json_file_path) as fin:
            for line in fin:
                line_contents = json.loads(line)
                csv_file.writerow(get_row(line_contents, column_names))

def get_superset_of_column_names_from_file(json_file_path):
    """Read in the json dataset file and return the superset of column names."""
    column_names = set()
    with open(json_file_path) as fin:
        for line in fin:
            line_contents = json.loads(line)
            column_names.update(
                    set(get_column_names(line_contents).keys())
                    )
    return column_names

def get_column_names(line_contents, parent_key=''):
    """Return a list of flattened key names given a dict.
    Example:
        line_contents = {
            'a': {
                'b': 2,
                'c': 3,
                },
        }
        will return: ['a.b', 'a.c']
    These will be the column names for the eventual csv file.
    """
    column_names = []
    for k, v in line_contents.iteritems():
        column_name = "{0}.{1}".format(parent_key, k) if parent_key else k
        if isinstance(v, collections.MutableMapping):
            column_names.extend(
                    get_column_names(v, column_name).items()
                    )
        else:
            column_names.append((column_name, v))
    return dict(column_names)

def get_nested_value(d, key):
    """Return a dictionary item given a dictionary `d` and a flattened key from `get_column_names`.
    
    Example:
        d = {
            'a': {
                'b': 2,
                'c': 3,
                },
        }
        key = 'a.b'
        will return: 2
    
    """
    if '.' not in key:
        if key not in d:
            return None
        return d[key]
    base_key, sub_key = key.split('.', 1)
    if base_key not in d:
        return None
    sub_dict = d[base_key]
    return get_nested_value(sub_dict, sub_key)

def get_row(line_contents, column_names):
    """Return a csv compatible row given column names and a dict."""
    row = []
    for column_name in column_names:
        line_value = get_nested_value(
                        line_contents,
                        column_name,
                        )
        if isinstance(line_value, unicode):
            row.append('{0}'.format(line_value.encode('utf-8')))
        elif line_value is not None:
            row.append('{0}'.format(line_value))
        else:
            row.append('')
    return row

if __name__ == '__main__':
    """Convert a yelp dataset file from json to csv."""

    json_file = 'yelp_academic_dataset.json'
    csv_file = '{0}.csv'.format(json_file.split('.json')[0])

    column_names = get_superset_of_column_names_from_file(json_file)
    read_and_write_file(json_file, csv_file, column_names)

In [3]:
file = 'yelp_academic_dataset_checkin.json'
ydf = pd.read_table(file)
print ydf.head()

  {"checkin_info": {"9-5": 1, "7-5": 1, "13-3": 1, "17-6": 1, "13-0": 1, "17-3": 1, "10-0": 1, "18-4": 1, "14-6": 1}, "type": "checkin", "business_id": "cE27W9VPgO88Qxe4ol6y_g"}
0  {"checkin_info": {"22-5": 1, "9-5": 1, "15-1":...                                                                                                                             
1  {"checkin_info": {"9-1": 1, "18-3": 1, "15-1":...                                                                                                                             
2  {"checkin_info": {"9-0": 1, "18-1": 1, "19-6":...                                                                                                                             
3  {"checkin_info": {"9-4": 1, "12-2": 1, "7-4": ...                                                                                                                             
4  {"checkin_info": {"9-0": 1, "9-4": 1, "13-2": ...                                                          

In [6]:
file = 'yelp_academic_dataset_review.json'
ydf = pd.read_table(file)
print ydf.head()

  {"votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"}
0  {"votes": {"funny": 0, "useful": 2, "cool": 0}...                                                                                                                                                                                       

In [8]:
file = 'yelp_academic_dataset_tip.json'
ydf = pd.read_table(file)
print ydf.head()

  {"user_id": "-6rEfobYjMxpUWLNxszaxQ", "text": "Don't waste your time.", "business_id": "cE27W9VPgO88Qxe4ol6y_g", "likes": 0, "date": "2013-04-18", "type": "tip"}
0  {"user_id": "EZ0r9dKKtEGVx2CdnowPCw", "text": ...                                                                                                               
1  {"user_id": "xb6zEQCw9I-Gl0g06e1KsQ", "text": ...                                                                                                               
2  {"user_id": "fvTivrsJoUMYXnOJw9wZfw", "text": ...                                                                                                               
3  {"user_id": "6GrH6gp09pqYykGv86D6Dg", "text": ...                                                                                                               
4  {"user_id": "gl46Pxc4OzLai8JVyxUIwA", "text": ...                                                                                                               


{"user_id": "-6rEfobYjMxpUWLNxszaxQ", "text": "Don't waste your time.", "business_id": "cE27W9VPgO88Qxe4ol6y_g", "likes": 0, "date": "2013-04-18", "type": "tip"}
0  {"user_id": "EZ0r9dKKtEGVx2CdnowPCw", "text": ...                                                                                                               
1  {"user_id": "xb6zEQCw9I-Gl0g06e1KsQ", "text": ...                                                                                                               
2  {"user_id": "fvTivrsJoUMYXnOJw9wZfw", "text": ...                                                                                                               
3  {"user_id": "6GrH6gp09pqYykGv86D6Dg", "text": ...                                                                                                               
4  {"user_id": "gl46Pxc4OzLai8JVyxUIwA", "text": ...                                                                                                               


In [7]:
file = 'yelp_academic_dataset_user.json'
ydf = pd.read_table(file)
print ydf.head()

  {"yelping_since": "2004-10", "votes": {"funny": 166, "useful": 278, "cool": 245}, "review_count": 108, "name": "Russel", "user_id": "18kPq7GPye-YQ3LyKyAZPw", "friends": ["rpOyqD_893cqmDAtJLbdog", "4U9kSBLuBDU391x6bxU-YA", "fHtTaujcyKvXglE33Z5yIw", "8J4IIYcqBlFch8T90N923A", "wy6l_zUo7SN0qrvNRWgySw", "HDQixQ-WZEV0LVPJlIGQeQ", "T4kuUr_iJiywOPdyM7gTHQ", "z_5D4XEIlGAPjG3Os9ix5A", "i63u3SdbrLsP4FxiSKP0Zw", "pnrGw4ciBXJ6U5QB2m0F5g", "ytjCBxosVSqCOQ62c4KAxg", "r5uiIxwJ-I-oHBkNY2Ha3Q", "niWoSKswEbooJC_M7HMbGw", "kwoxiKMyoYjB1wTCYAjYRg", "9A8OuP6XwLwnNb9ov3_Ncw", "27MmRg8LfbZXNEHkEnKSdA", "Bn4sJUTtKFZQt0FKHF2Adw", "uguXfIEpI65jSCH5MgUDgA", "6VZNGc2h2Bn-uyuEXgOt5g", "AZ8CTtwr-4sGM2kZqF6qig", "S742m-AuQicMSLDdErrLZQ", "uGmQ6ab4iVpWn5m61VFhkQ", "GJYJX4SujVj3BR8v2F9PDQ", "3shjifK-vZkIHciyy_KbYA", "4lc_H2Cf7CO0tCgyA3aSVQ", "Tunkp_F1R_uFBJQTsDxD4g", "B9pKfr27czBbCoAIircZdQ", "pePGMO6EbDpbaZ7D2m6HIg", "XRM8W6HUoXbrYKR3BCj9Rg", "8DqIWXsKXOipfduYEfFpNw", "dvRVX54Z9f7Om51NsTRX1w", "CM0saLQmk4oAB17UmQTV-

## Models Evaluation:
    - Logistic Regression
    - Naive Bayes Classification
    - Classification Tree
    - k-Nearest Neighbors
    - Support Vector Machines
    - Random Forest Classification
    - Neural Networks
