# Homework 3 : Predicting Business Cateogries using Checkin Data in Yelp (100 points)

*------------

** NOTE **
* Please don't forget to save the notebook frequently when working in IPython Notebook, otherwise the changes you made can be lost.

*----------------------

* Download the [yelp dataset](https://www.yelp.com/dataset_challenge)
* We need to use the following two files in the dataset:
    * yelp_academic_dataset_business.json
    * yelp_academic_dataset_checkin.json
* Each file in the dataset is composed of one json-object per line. 

## Business Objects

Business objects contain basic information about local businesses. The fields are as follows:

```json
{
  'type': 'business',
  'business_id': (a unique identifier for this business),
  'name': (the full business name),
  'neighborhoods': (a list of neighborhood names, might be empty),
  'full_address': (localized address),
  'city': (city),
  'state': (state),
  'latitude': (latitude),
  'longitude': (longitude),
  'stars': (star rating, rounded to half-stars),
  'review_count': (review count),
  'photo_url': (photo url),
  'categories': [(localized category names)]
  'open': (is the business still open for business?),
  'schools': (nearby universities),
  'url': (yelp url)
}
```
## Checkin Objects
```json
{
    'type': 'checkin',
    'business_id': (encrypted business id),
    'checkin_info': {
        '0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
        '1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
        ...
        '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
        ...
        '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
    }, # if there was no checkin for a hour-day block it will not be in the dict
}
```
* The task is to predict the 'business category'  for a given business based upon its checkin_info. 


# Problem 1 (20 points): Load and explore Yelp Data
## Problem 1.1 (5 points): load the checkin objects into a list

In [1]:
import json

#------------------------
filename_checkin = "yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_checkin.json"

list_checkin = []
with open(filename_checkin) as file_checkin:
    for line in file_checkin:
        try:
            list_checkin.append(json.loads(line))
        except:
            print "This line is broken:{:}".format(line)


#list_checkin <-- your list
#------------------------

# print the number of checkin objects
print len(list_checkin)
# print the first checkin object
print json.dumps(list_checkin[0], indent=1)

61049
{
 "checkin_info": {
  "9-5": 1, 
  "7-5": 1, 
  "13-3": 1, 
  "17-6": 1, 
  "13-0": 1, 
  "17-3": 1, 
  "10-0": 1, 
  "18-4": 1, 
  "14-6": 1
 }, 
 "type": "checkin", 
 "business_id": "cE27W9VPgO88Qxe4ol6y_g"
}


## Problem 1.2 (5 points): load the business objects into a list

In [2]:
import json

#------------------------

filename_business = "yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"

list_business = []
with open(filename_business) as file_business:
    for line in file_business:
        try:
            list_business.append(json.loads(line))
        except:
            print "This line is broken:{:}".format(line)

#------------------------

# print the number of business objects
print len(list_business)
# print the first business object
print json.dumps(list_business[0], indent=1)

85901
{
 "city": "Dravosburg", 
 "review_count": 7, 
 "name": "Mr Hoagie", 
 "neighborhoods": [], 
 "type": "business", 
 "business_id": "5UmKMjUEUNdYWqANhGckJw", 
 "full_address": "4734 Lebanon Church Rd\nDravosburg, PA 15034", 
 "hours": {
  "Tuesday": {
   "close": "21:00", 
   "open": "11:00"
  }, 
  "Friday": {
   "close": "21:00", 
   "open": "11:00"
  }, 
  "Monday": {
   "close": "21:00", 
   "open": "11:00"
  }, 
  "Wednesday": {
   "close": "21:00", 
   "open": "11:00"
  }, 
  "Thursday": {
   "close": "21:00", 
   "open": "11:00"
  }
 }, 
 "state": "PA", 
 "longitude": -79.9007057, 
 "stars": 3.5, 
 "latitude": 40.3543266, 
 "attributes": {
  "Take-out": true, 
  "Drive-Thru": false, 
  "Outdoor Seating": false, 
  "Caters": false, 
  "Noise Level": "average", 
  "Parking": {
   "garage": false, 
   "street": false, 
   "validated": false, 
   "lot": false, 
   "valet": false
  }, 
  "Delivery": false, 
  "Attire": "casual", 
  "Has TV": false, 
  "Price Range": 1, 
  "Good 

## Problem 1.3 (5 points) Finding the most popular business categories
* print the top 10 most popular business categories in the dataset and their counts (i.e., how many business objects in each category). Here we say a category is "popular" = more business objects in this category.


In [3]:
import collections

c = collections.Counter()

for business in list_business:
    c.update(business['categories'])

for cat, count in c.most_common(10):
    print "There are {:5d} businesses in the {:} category".format(count, cat)


There are 26729 businesses in the Restaurants category
There are 12444 businesses in the Shopping category
There are 10143 businesses in the Food category
There are  7490 businesses in the Beauty & Spas category
There are  6106 businesses in the Health & Medical category
There are  5866 businesses in the Home Services category
There are  5507 businesses in the Nightlife category
There are  4888 businesses in the Automotive category
There are  4727 businesses in the Bars category
There are  4041 businesses in the Local Services category


## Problem 1.4 (5 points) Finding the most popular business objects
* print the top 10 most popular business objects in the dataset and their counts (i.e., how many checkins in total for each business object). 
Here we say a business object is "popular" =  more checkins in this business object


In [4]:
import collections

c = collections.Counter()

for checkin in list_checkin:
    c[checkin['business_id']] = sum(checkin['checkin_info'].values())

dict_business = {}
for business in list_business:
    dict_business[business['business_id']] = business

for business_id, count in c.most_common(10):
    business = dict_business[business_id]
    print "{:} has {:5d} checkins".format(business['name'], count)
    # I'm not printing these out because they take up a *lot* of space
    # print json.dumps(business, indent=1)


McCarran International Airport has 85243 checkins
Phoenix Sky Harbor International Airport - PHX has 75705 checkins
Charlotte Douglas International Airport has 33293 checkins
The Cosmopolitan of Las Vegas has 29095 checkins
ARIA Hotel & Casino has 19566 checkins
The Venetian Las Vegas has 19434 checkins
MGM Grand Hotel has 19201 checkins
Caesars Palace Las Vegas Hotel & Casino has 18948 checkins
Bellagio Hotel has 18914 checkins
Kung Fu Tea has 17810 checkins


# Problem 2 (20 points) Pick one of business category from the top 5 most popular categories and prepare the dataset

## Problem 2.1 (10 points) extract the feature vectors for business objects using checkin information.
* Each business object should be represented by a vector of numbers, indicating the number checkins in each time slots.

For example, suppose we have a dataset of three business objects.
* checkin_info for business A:{'0-0': 5, '23-6': 6} 
* checkin_info for business B:{'0-0': 7, '21-6': 9} 
* checkin_info for business C:{'2-2': 8, '23-6': 3}

The feature vectors for the three objects should be 
* A: (5, 0, 0, 6)
* B: (7, 0, 9, 0)
* C: (0, 8, 0, 3)

The values in the feature vector corespond to the numbers of checkins in time slots("0-0", "2-2", "21-6", "23-6") separately

In [5]:
dict_checkin = {}
for checkin in list_checkin:
    dict_checkin[checkin['business_id']] = checkin

dict_business_vector = {}
misses = 0

# Creates a pretty long vector...
categories = [ "{:}-{:}".format(i, j) for i in range(24) for j in range(7)]
def vector_from_checkin(checkin):
    vector = tuple([checkin['checkin_info'].get(category, 0) for category in categories])
    return vector

for business in list_business:
    id = business['business_id']
    if(id in dict_checkin):
        dict_business_vector[id] = vector_from_checkin(dict_checkin[id])
    else:
        dict_business_vector[id] = vector_from_checkin({'checkin_info':{}})
        misses += 1
        

if(misses + len(list_checkin) - len(list_business) !=0):
    print("There's an error somewhere... we didn't calculate the right number of vectors from the businesses and checkins")





list_feature_vectors = [ dict_business_vector[id] for id in sorted(dict_business_vector.keys())]

## Problem 2.2 (10 points) extract the labels (category)
* Choose a category of your choise, for example "Grocery", from **the top 5 popular categories**. Note, the example "Grocery" may not be among the top 5.
* Extract a vector of labels for ** all the business objects in the Problem 2.1**. Note the order of the business objects in Problem 2.1 and 2.2 should be the same.
* The label indicates whether or not the business object belongs to the category that you chose (say "Grocery" for example). If yes, the value is 1, otherwise -1.

For example, suppose we have a dataset of three business objects. Suppose you choose "Grocery" as the target category.
* categories for business A:{'Grocery', 'Food'} 
* categories for business B:{'Food', 'Drink'} 
* categories for business C:{'Grocery'}

The labels for the three objects should be 
* A: 1
* B: -1
* C: 1

In [6]:
target_category = "Restaurants" 

dict_labels = {}
for id in dict_business_vector.keys():
    if target_category in dict_business[id]['categories']:
        dict_labels[id] = 1
    else:
        dict_labels[id] = -1


list_labels = [ dict_labels[id] for id in sorted(dict_labels.keys())]

Now let's split the dataset into a training set and test set for classification task.

In [7]:
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(list_feature_vectors,list_labels,test_size=0.2)



# Problem 3 (15 points): Support vector machine (SVM)
* Tune the parameters of SVM on the training data to find the best parameter setting. You could tune the following two parameters in SVM ('C', 'kernel') using the cadidate values in the cell below.
* Apply the best SVM model to the test data to predict the labels.

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

C_choices=[0.001, 0.01, 0.1, 1, 10, 100, 1000]
kernel_choices=['linear', 'poly', 'rbf', 'sigmoid']

# insert your code here

# Kernel and C (amount we're punishing for an error) are relatively independent, so I can do one after another.

# I'm also using a smaller training set for determining the best values
#    because longer ones take too much time, and doesn't change that much... hopefully

best_kernel = 'bad kernel'
best_score = 0
for kernel in kernel_choices:
    svc = SVC(C=1, kernel=kernel)
    svc.fit(X_train[0:1000], y_train[0:1000])
    score = svc.score(X_test, y_test)
    if score > best_score:
        best_score = score
        best_kernel = kernel
    print "{:} kernel has score {:}".format(kernel, score)

print "The best kernel is {:} with a score of {:}".format(best_kernel, best_score)

best_C = 'bad choice'
best_score = 0
for C in C_choices:
    svc = SVC(C=C, kernel=best_kernel)
    svc.fit(X_train[0:100],y_train[0:100])
    score = svc.score(X_test, y_test)
    if score > best_score:
        best_score = score
        best_C = C
    print "When C={:}, score={:}".format(C,score)

print "The best C is {:} with score {:}".format(best_C, best_score)

best_svm_model = SVC(C=best_C, kernel=best_kernel)# the SVM model using the best parameter setting
print best_svm_model


# insert your code here

# This will take a while...
# Okay... if best_kernel isn't 'linear', then it takes *forever* to train
#    "The fit time complexity is more than quadratic with the number of
#    samples which makes it hard to scale to dataset with more than a
#    couple of 10000 samples." -- http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
# I've found that this starts to take unreasonably long when I go to around 750 values... which is pretty sad (and inaccurate).

svc.fit(X_train[0:500], y_train[0:500])

y_pred_svm = svc.predict(X_test)# the predicted lables

linear kernel has score 0.718060648391
poly kernel has score 0.733368255631
rbf kernel has score 0.736103835632
sigmoid kernel has score 0.608986671323
The best kernel is rbf with a score of 0.736103835632
When C=0.001, score=0.689948198591
When C=0.01, score=0.689948198591
When C=0.1, score=0.689948198591
When C=1, score=0.692974797742
When C=10, score=0.709446481578
When C=100, score=0.660497060707
When C=1000, score=0.660380653047
The best C is 10 with score 0.709446481578
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


Let's evaluate the performance of the prediction

In [9]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# evaluate the results
print classification_report(y_test, y_pred_svm)
print confusion_matrix(y_test, y_pred_svm,labels=(-1,1))

             precision    recall  f1-score   support

         -1       0.73      0.84      0.78     11854
          1       0.46      0.29      0.36      5327

avg / total       0.64      0.67      0.65     17181

[[9978 1876]
 [3757 1570]]


# Problem 4 (45 points): Decision Tree, Bagging, Random Forest
## Problem 4.1 (15 points): Decision Tree
* Tune the parameters of Decision Tree on the training data to find the best parameter setting. You should tune the  parameter *max_depth* using the cadidate values in the cell below.
* Apply the best model to the test data to predict the labels.

In [3]:
from sklearn.tree import DecisionTreeClassifier

max_depth_choices=range(5,25,5)
# insert your code here

best_score = 0
best_tree_model = "Invalid tree model"
for max_depth in max_depth_choices:
    depth_tree = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=max_depth)    
    depth_tree.fit(X_train, y_train)
    score = depth_tree.score(X_test, y_test)
    if(score > best_score):
        best_tree_model = depth_tree
        best_score = score
    print "Score at depth {:}: {:}".format(max_depth, score)

# the model using the best parameter setting
print best_tree_model

y_pred_tree = best_tree_model.predict(X_test)# the predicted labels

NameError: name 'X_train' is not defined

Let's evaluate the performance of the prediction

In [10]:
# evaluate the results
print classification_report(y_test, y_pred_tree)
print confusion_matrix(y_test, y_pred_tree,labels=(-1,1))

             precision    recall  f1-score   support

         -1       0.76      0.93      0.83     11847
          1       0.68      0.35      0.46      5334

avg / total       0.74      0.75      0.72     17181

[[10969   878]
 [ 3459  1875]]


## Problem 4.2 (15 points): Bagging
* Tune the parameters of Bagging method on the training data to find the best parameter setting. You could tune the following parameters: max_depth, n_estimators 
* Apply the best model to the test data to predict the labels.

In [11]:
from sklearn.ensemble import BaggingClassifier

max_depth_choices=range(5,25,5)
n_estimators_choices=range(5, 25, 5) 

# insert your code here

best_score = 0
best_bagging_model = "Invalid bagging model"
for max_depth in max_depth_choices:
    for n_estimators in n_estimators_choices:
        model = BaggingClassifier(n_estimators=n_estimators, max_samples=max_depth)
        model.fit(X_train, y_train)
        score = model.score(X_test, y_test)
        if(score > best_score):
            best_bagging_model = model
            best_score = score
        print "Score with {:} samples and {:} estimators: {:}".format(max_depth, n_estimators, score)

print best_bagging_model # the model using the best parameter setting

y_pred_bagging = best_bagging_model.predict(X_test)# the predicted labels


Score with 5 samples and 5 estimators: 0.661835748792
Score with 5 samples and 10 estimators: 0.685350096036
Score with 5 samples and 15 estimators: 0.690704848379
Score with 5 samples and 20 estimators: 0.689657179442
Score with 10 samples and 5 estimators: 0.682498108376
Score with 10 samples and 10 estimators: 0.705022990513
Score with 10 samples and 15 estimators: 0.699726442
Score with 10 samples and 20 estimators: 0.70281124498
Score with 15 samples and 5 estimators: 0.691461498167
Score with 15 samples and 10 estimators: 0.710668762005
Score with 15 samples and 15 estimators: 0.713695361155
Score with 15 samples and 20 estimators: 0.703567894767
Score with 20 samples and 5 estimators: 0.685699319015
Score with 20 samples and 10 estimators: 0.696408823701
Score with 20 samples and 15 estimators: 0.706128863279
Score with 20 samples and 20 estimators: 0.709155462429
BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samp

Let's evaluate the performance of the prediction

In [15]:
# evaluate the results
print classification_report(y_test, y_pred_bagging)
print confusion_matrix(y_test, y_pred_bagging,labels=(-1,1))

             precision    recall  f1-score   support

         -1       0.74      0.90      0.81     11847
          1       0.57      0.31      0.40      5334

avg / total       0.69      0.71      0.68     17181

[[10634  1213]
 [ 3706  1628]]



## Problem 4.3 (15 points): Random Forest
* Tune the parameters of Random Forest method on the training data to find the best parameter setting. You could tune the following parameter: max_depth, n_estimators max_features
* Apply the best model to the test data to predict the labels.


In [13]:
from sklearn.ensemble import RandomForestClassifier

max_depth_choices=range(5,25,5)
n_estimators_choices=range(5, 25, 5) 
max_features_choices=range(1,5)

# insert your code here
best_score = 0
best_rf_model = "Invalid random forest model"
for max_depth in max_depth_choices:
    for n_estimators in n_estimators_choices:
        for max_features in max_features_choices:
            model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features)
            model.fit(X_train, y_train)
            score = model.score(X_test, y_test)
            if(score > best_score):
                best_rf_model = model
                best_score = score
            print "Score with {:} samples and {:} estimators: {:}".format(max_depth, n_estimators, score)





# the model using the best parameter setting
print best_rf_model


# insert your code here



y_pred_rf = best_rf_model.predict(X_test)# the predicted lables




Score with 5 samples and 5 estimators: 0.716372737326
Score with 5 samples and 5 estimators: 0.725452534777
Score with 5 samples and 5 estimators: 0.731854956056
Score with 5 samples and 5 estimators: 0.736336650952
Score with 5 samples and 10 estimators: 0.721552878179
Score with 5 samples and 10 estimators: 0.73383388627
Score with 5 samples and 10 estimators: 0.734765147547
Score with 5 samples and 10 estimators: 0.73686048542
Score with 5 samples and 15 estimators: 0.71212385775
Score with 5 samples and 15 estimators: 0.731272917758
Score with 5 samples and 15 estimators: 0.733251847972
Score with 5 samples and 15 estimators: 0.736103835632
Score with 5 samples and 20 estimators: 0.717827833071
Score with 5 samples and 20 estimators: 0.732669809673
Score with 5 samples and 20 estimators: 0.735521797334
Score with 5 samples and 20 estimators: 0.736278447122
Score with 10 samples and 5 estimators: 0.734183109249
Score with 10 samples and 5 estimators: 0.738606600314
Score with 10 sam

Let's evaluate the performance of the prediction

In [14]:
# evaluate the results
print classification_report(y_test, y_pred_rf)
print confusion_matrix(y_test, y_pred_rf,labels=(-1,1))

             precision    recall  f1-score   support

         -1       0.77      0.95      0.85     11847
          1       0.78      0.36      0.49      5334

avg / total       0.77      0.77      0.74     17181

[[11311   536]
 [ 3420  1914]]


*-----------------
# Done

All set! 

** What do you need to submit?**

* **Notebook File**: Save this IPython notebook, and find the notebook file in your folder (for example, "filename.ipynb"). This is the file you need to submit. Please make sure all the plotted tables and figures are in the notebook. If you used "ipython notebook --pylab=inline" to open the notebook, all the figures and tables should have shown up in the notebook.


** How to submit: **
  Please submit your notebook file through myWPI, in the Assignment "Homework 3".