# Setting up Foursquare data for analysis 


> Note: This will be a very open ended lab, since everyone may end up using different geographies and starting seed geographies. Be prepared to walk around and hand-hold some people, I've tested this out on several locales around me and it works, for most, but if you don't have a good starting seed location, the procedure may not scrape well.

Today's lab is going to get your hands dirty with respect to the Foursquare API. We're also going to build a simple crawler/scraper that will go through the JSON hierarchy, extract the data we want, and deposit them into a Pandas table so we can do simple analysis. 

Just in case you're unfamiliar with this concept, please refer to the Wikipedia page (it's actually pretty good): https://en.wikipedia.org/wiki/Web_scraping, and maybe spend a few moments discussing the concepts and how it could help you in the future as a data scientist to have this "hackish" skill. 

Setup your access token to foursquare

In [5]:
# Solutions

import foursquare
import json
import pandas as pd
import unicodedata
import user_keys 

CLIENT_ID = user_keys.CLIENT_ID   # Input your client id/ client secret here
CLIENT_SECRET = user_keys.CLIENT_SECRET

client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET)

Get parameters for a bounded box here: http://boundingbox.klokantech.com/
Output CSV and copy and paste values here

In [12]:
#Here are values for Santa Monica with bounding box
x = -118.517415
y = 33.993166
z = -118.443426
t = 34.05056

bounding_box = [x,y,z,t]  #Input the raw CSV form in here

In [13]:
#review structure of json via the readme and api docs
GA_LAT = 34.0108014
GA_LONG = -118.5184891
data = client.venues.search(params={'ll': "34.0108014, -118.5184891"})  # Put in a lat/long you're interested in

print 'Output venue name:', data['venues'][0]['name']
print 'Number of venues: ', len(data['venues'])

Output venue name: Downtown Santa Monica
Number of venues:  30


Use a foursquare python library method to search for suitable venues around a city near you. Print the associated JSON output in a nice way with appropriate spacing and indentation

In [14]:
# Solution

starting_list = client.venues.search(params={'near': 'Santa Monica, CA', 'radius':'1500'})
print(starting_list)


{'confident': False, 'geocode': {'parents': [], 'what': u'', 'where': 'santa monica ca', 'feature': {'highlightedName': '<b>Santa Monica</b>, <b>CA</b>, United States', 'displayName': 'Santa Monica, CA, United States', 'name': 'Santa Monica', 'longId': '72057594043321148', 'cc': 'US', 'id': 'geonameid:5393212', 'geometry': {'center': {'lat': 34.01945, 'lng': -118.49119}, 'bounds': {'sw': {'lat': 33.995416, 'lng': -118.517415}, 'ne': {'lat': 34.05056, 'lng': -118.443517}}}, 'matchedName': 'Santa Monica, CA, United States', 'woeType': 7, 'slug': 'santa-monica-california'}}, 'venues': [{'hasMenu': True, 'verified': True, 'name': 'Jack in the Box', 'referralId': 'v-1469606406', 'venueChains': [{'id': '556e1846a7c82e6b72513d66'}], 'url': 'http://https://www.jackinthebox.com/locations/153', 'menu': {'url': 'https://foursquare.com/v/jack-in-the-box/4b004cebf964a520813c22e3/menu', 'mobileUrl': 'https://foursquare.com/v/4b004cebf964a520813c22e3/device_menu', 'type': 'Menu', 'anchor': 'View Menu

Wow... that should look like a total mess to you. Read the following docs: https://docs.python.org/2/library/json.html, and read the part about pretty printing. Once you think you've understood the method, deploy it here and see the world a difference a bit of spacing and indenting makes! 

In [15]:
print(json.dumps(starting_list, indent = 4))

{
    "confident": false, 
    "geocode": {
        "parents": [], 
        "what": "", 
        "where": "santa monica ca", 
        "feature": {
            "highlightedName": "<b>Santa Monica</b>, <b>CA</b>, United States", 
            "displayName": "Santa Monica, CA, United States", 
            "name": "Santa Monica", 
            "longId": "72057594043321148", 
            "cc": "US", 
            "id": "geonameid:5393212", 
            "geometry": {
                "center": {
                    "lat": 34.01945, 
                    "lng": -118.49119
                }, 
                "bounds": {
                    "sw": {
                        "lat": 33.995416, 
                        "lng": -118.517415
                    }, 
                    "ne": {
                        "lat": 34.05056, 
                        "lng": -118.443517
                    }
                }
            }, 
            "matchedName": "Santa Monica, CA, United States", 
            "wo

Now that we can make some sense of the structure let's practice traversing the JSON hieararchy, select one of the venues in the list and output it's name

In [16]:
# Solution
type(starting_list['venues'][23]['categories'][0]['name'])

str

Note that the output isn't exactly what we want. It says u'Park', and if you check the type, Python will output Unicode. This isn't good, we need to recover the original intended type. Read the following docs: 

https://docs.python.org/2/library/unicodedata.html, and checkup the method 'normalize'. Once you think you've understood this method. Implement it on the above call and see if you can recover the appropriate type for that data.


Now for some exploratory analysis, let's print the number of total venues in your list

In [17]:
# Solution

len(starting_list['venues'])

30

Extract the location id for your starting list. Make sure it's normalized to its correct type, and not Unicode. Put this id in a variable called temp. From this id, we will get a list of other venues.

In [50]:
temp = str(starting_list['venues'][1]['id'])
print temp

4c47c5d976d72d7f2a673e4d


In [52]:
#new tool that potentially can be used to address unicode
#unicodedata.normalize('NFD', temp).encode('ascii','ignore')

Print the venues list (in the nicely formatted JSON)

In [53]:
# Solution

temp1 = client.venues(temp);
print(json.dumps(temp1, indent = 4))

{
    "venue": {
        "reasons": {
            "count": 1, 
            "items": [
                {
                    "reasonName": "rawLikesReason", 
                    "type": "general", 
                    "summary": "Lots of people like this place"
                }
            ]
        }, 
        "likes": {
            "count": 700, 
            "groups": [
                {
                    "count": 700, 
                    "items": [], 
                    "type": "others"
                }
            ], 
            "summary": "700 Likes"
        }, 
        "id": "4c47c5d976d72d7f2a673e4d", 
        "createdAt": 1279772121, 
        "verified": false, 
        "venueRatingBlacklisted": true, 
        "hereNow": {
            "count": 1, 
            "groups": [
                {
                    "count": 1, 
                    "items": [], 
                    "type": "others", 
                    "name": "Other people here"
                }
            ],

Create a procedure that will only extract the comments in a list. There are a few ways you can do this, but I highly recommend you look up the map method from the base Python library: https://docs.python.org/2/tutorial/datastructures.html

This is the same "map" function, that's one part of the map-reduce duo used in "Big Data" applications. So it may be helpful to get familiar with this method now if that's where you think you may want to take your career in the future. 

In [44]:
# Solution
map(lambda h: h['text'], temp1['venue']['tips']['groups'][0]['items'])

['By far the most desirable part of LA to live in, Santa Monica boasts shopping, restaurants and of course ...the beach. High rent, limited space and tourists are a compromise for this prime location.',
 'Santa Monica is one of the best coastal cities in LA. Enjoy a walk on the Strand on a nice day and check out the Annenberg Community Beach House which is open to the public!',
 'Santa Monica is the best place ever.',
 u"I'm so happy and grateful to call Santa Monica home!  \u2764",
 'The Best City to Live & Work In, Hands Down!!! <3',
 'Best city to live in, ever!',
 u'It\u2019s a cute little city that ticks all the right boxes: great food, shopping, beaches, entertainment, and culture\u2026and all of it infused with quintessential California cool.',
 'A melhor cidade da California!!!',
 '<3 SaMo <3 Best City I EVER Lived In!',
 'You can always find a great spot to meditate in the morning!',
 "Great workout area just south of the pier. It's on the beach and its free!",
 "It's extremel

Now we're going to bring the above mini-tasks together into a nice little method, that will allow us to convert any foursquare JSON data into a nice tabular / rectangular table for further analysis. First instnatiate a pandas data frame.

In [45]:
venue_table = pd.DataFrame()

Write a procedure that will take your list of venues around a certain geography/lat/long whatever, and output a table that will have for each row, a comment associated for the venue (multiple comments will mean multiple rows, each per comment), the venue name, the tip count, the user count, and the store category. Make sure that each column is populated with appropriately typed values, i.e. names/categories should be strings, and numbers should be numerical data type.

> To the instructor: I usually don't have this much latitude to the student, but it was requested that I give some "open ended"/"munch on" problems. I suspect the students will spend the most time here, they will certainly get errors, and they will be frustrated. Look through the ideal solution and be prepared to step in when appropriate. 

**Hint**: Before you begin, think about the process. You're going to start with a loop of some kind, then think about the following:
- How many of those do you need? 
- Think about the JSON structure, how "deep" do you need to penetrate the hierarchy to reach the data you want (this will help you think about how many loops you need for your crawler
- How should you iteratively add on to your Pandas data frame? 
- Think of any tests you may need to put in to ensure your procedure does not cause an error (this may help you figure out how many if statements you may need, and where to place them.


In [58]:
# Solution - Note to instructor, the code may be slightly different, in particular the student should have written error-exception protocols to account for any 
# missing/empty values that may cause the procedure to kick-out in an error.

for v_index in range(len(starting_list['venues'])-1):

    temp = starting_list['venues'][v_index]['id']
    temp1 = client.venues(temp)
    #print v_index
    comment_list = map(lambda h: h['text'], temp1['venue']['tips']['groups'][0]['items'])
    for c_index in range(len(comment_list)-1):
        #print c_index
        comment_converter = comment_list[c_index]

        #print "test"
        if (starting_list['venues'][v_index]['categories']) != []:  
            venue_table = venue_table.append(pd.DataFrame({"name": starting_list['venues'][v_index]['name'],
                                            "tip count": starting_list['venues'][v_index]['stats']['tipCount'],
                                            "users count": starting_list['venues'][v_index]['stats']['usersCount'],
                                             "store category": starting_list['venues'][v_index]['categories'][0]['name'], 
                                             "comments": comment_converter}, index = [v_index + c_index]))
        else:
            venue_table = venue_table.append(pd.DataFrame({"name": starting_list['venues'][v_index]['name'],
                                            "tip count": starting_list['venues'][v_index]['stats']['tipCount'],
                                            "users count": starting_list['venues'][v_index]['stats']['usersCount'],
                                             "store category": "No categories", 
                                             "comments": comment_converter}, index = [v_index + c_index]))


In [None]:
# Unicode Version
# Solution - Note to instructor, the code may be slightly different, in particular the student should have written error-exception protocols to account for any 
# missing/empty values that may cause the procedure to kick-out in an error.

for v_index in range(len(starting_list['venues'])-1):
    temp = unicodedata.normalize('NFKD', starting_list['venues'][v_index]['id']).encode('ascii','ignore')
    temp1 = client.venues(temp)
    print v_index
    comment_list = map(lambda h: h['text'], temp1['venue']['tips']['groups'][0]['items'])
    for c_index in range(len(comment_list)-1):
        print c_index
        comment_converter = unicodedata.normalize('NFKD', comment_list[c_index]).encode('ascii','ignore')

        print "test"
        if (starting_list['venues'][v_index]['categories']) != []:  
            venue_table = venue_table.append(pd.DataFrame({"name": unicodedata.normalize('NFKD', starting_list['venues'][v_index]['name']).encode('ascii','ignore'),
                                            "tip count": starting_list['venues'][v_index]['stats']['tipCount'],
                                            "users count": starting_list['venues'][v_index]['stats']['usersCount'],
                                             "store category": unicodedata.normalize('NFKD', starting_list['venues'][v_index]['categories'][0]['name']).encode('ascii','ignore'), 
                                             "comments": comment_converter}, index = [v_index + c_index]))
        else:
            venue_table = venue_table.append(pd.DataFrame({"name": unicodedata.normalize('NFKD', starting_list['venues'][v_index]['name']).encode('ascii','ignore'),
                                            "tip count": starting_list['venues'][v_index]['stats']['tipCount'],
                                            "users count": starting_list['venues'][v_index]['stats']['usersCount'],
                                             "store category": "No categories", 
                                             "comments": comment_converter}, index = [v_index + c_index]))


Finally, output the Venue table

In [59]:
venue_table.drop_duplicates()

Unnamed: 0,comments,name,store category,tip count,users count
0,Bums love it here! So will you!,Jack in the Box,Burger Joint,12,879
1,Apparently this place doubles as a homeless sh...,Jack in the Box,Burger Joint,12,879
2,The most random shit happens here.,Jack in the Box,Burger Joint,12,879
3,Quick service and friendly customer service,Jack in the Box,Burger Joint,12,879
4,Try the French fries!!!!!!,Jack in the Box,Burger Joint,12,879
5,He who licks dog chocolate is nucking futs.,Jack in the Box,Burger Joint,12,879
6,Sweetest staff & the #1 is delectable :),Jack in the Box,Burger Joint,12,879
7,Attendants at night suck!,Jack in the Box,Burger Joint,12,879
8,Terrible service. I came in and ordered a baco...,Jack in the Box,Burger Joint,12,879
9,Drive thru is always really fast in the morning,Jack in the Box,Burger Joint,12,879


You've done it! You've built a simple crawler that traverses a JSON directory, and you've deposited the results in a nice Pandas data frame. Congratulations! You're now ready for more data-mining in the future, and have just beefed up the **data** part of the data science combination :)