# Battle of Neighborhoods - How Health Conscious Is Your Neighborhood?
### Applied Data Science Capstone - Coursera/IBM

## Table of Contents
* [Introduction and Problem Statement](#introduction)
* [Data](#data)
    * [Data Extraction and Cleanup](#dataextraction)
    * [Data Summary](#datasummary)

## Introduction and Problem Statement <a name="introduction"></a>

As we see the world change around us, one of the noticeable change is how much health-conscious people have become. They are concious about working out, eating healthy and staying fit. One of the facility within a neighborhood that can indicate whether the inhibitants are taking their health seriously is the **number of Gyms available AND whether Gyms are among the top 10 recommended venues in that neighborhood.**

We are going to find a statistical method to evaluate every neighborhood within brorough of Manhattan in New York City and use folium maps to visualize the density. Based on the results, we would try to speculate if we can find top 5 health-conscious neighborhoods within the borough of Manhattan in New York city. 

So, let's find out **How Health Conscious is your neighborhood?**

## Data <a name="data"></a>

Analysis for problem statement would be based on following factors:

* Number of Gyms, Health Centers in the Neighborhood
* Number of Gyms/Fitness Centers appearing in the Top 10 recommended venues
* Type of fitness center frequented in the recommended list

Data Sources 
* **Google Geocoding API** for listing coordinates for Manhattan Neighborhood. - Google API is used for finding out coordinates based on addresses. 
* **NYU Spatial Data Repository** for listing neighborhoods in Manhattan - NYU Data contains list of Neighborhoods with coordinates that can be used to find recommended venues around Manhattan.
* The venue details are scraped from **Foursquare** API - We would be using the explore API to figure out recommended venues based on inputs

### Data Extraction and Cleanup <a name="dataextraction"></a>

In [1]:
# Declaring the API Keys and parameters
google_api_key=""
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20191201'
LIMIT = 100
radius = 500

In [60]:
# Install and Import required libraries

#!pip install folium
import folium

import requests
import pandas as pd
from pandas.io.json import json_normalize

import numpy as np
import json

#### Importing Neighborhood data from NYU file

In [3]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset

with open('newyork_data.json') as json_data:
    nyData = json.load(json_data)

#### Defining Functions that would be re-used within analysis

In [13]:
def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def getNearbyVenues(names, latitudes, longitudes, categories, radius=500):
    
    venues_list=[]
    cat = categories
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT,
            cat)
        
        #print("URL: ",url)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### In the Code section below, we would define the city for analysis and use pre-defined function for analysis

We would use New York as city and find the neighbourhoods within the city. Once we have the neighbourhoods, we would explore venues around these locations. 

Firstly, we would need to find out total venues available per neighborhood. After that, we would find the number of fitness centers/Gyms within neighborhood. For our analysis, we would try to find the ratio of gyms to all venues in the neighborhood. This would help us understand the distribution. 

#### Refining neighborhood data
We would be reading the JSON file for NY Neighborhoods and create a data frame with coordinates. This would be an input to the Foursquare API for finding venues.

In [8]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
dfNY = pd.DataFrame(columns=column_names)

for data in nyData['features']:

    lat = data['geometry']['coordinates'][1]
    lon = data['geometry']['coordinates'][0]
    
    if data['properties']['borough'] == "Manhattan":
        dfNY = dfNY.append({'Borough': data['properties']['borough'],
                    'Neighborhood': data['properties']['name'],
                    'Latitude': lat,
                    'Longitude': lon
                    }, ignore_index=True)

dfNY.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [14]:
address = 'Manhattan, New York, NY'
city_center = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, city_center))
print(address,' longitude={}, latitude={}'.format(city_center[1], city_center[0]))

city_venues = getNearbyVenues(names= dfNY['Neighborhood'],
                              latitudes=dfNY['Latitude'],
                              longitudes=dfNY['Longitude'],
                              categories=''
                              )

#city_venues.head(10)
print("Total Number of venues found {}".format(len(city_venues['Venue'])))

Coordinate of Manhattan, New York, NY: [40.7830603, -73.9712488]
Manhattan, New York, NY  longitude=-73.9712488, latitude=40.7830603
Total Number of venues found 3316


#### We would noe determine the "Gym/Fitness Center" categories from Foursquare data

There are fixed categories defined in Foursquare documentation. We are assinging them into a DataFrame. The venues returned by the Foursquare API would then be compared with these categories. If a Venue is categorized as Gym, then we would add an identifier within the data frame.

Once this data is available, it would be easier to analyze it in a single data frame. 

* We would map the data into a folium map to visually understand data distribution across Manhattan. 
* We can analyze by Neighborhood, the ratio of Gyms to all venues
* We can also understand popularity of a certain type of Gym in an area

In [62]:
# define Gym categories based on FS API documentation
gymCat = ['Gym / Fitness Center','Boxing Gym','Climbing Gym','Cycle Studio',
         'Gym Pool','Gymnastics Gym','Gym','Martial Arts Dojo',
         'Outdoor Gym','Pilates Studio','Track','Weight Loss Center',
         'Yoga Studio']

# Updating DF to add Is Gym identifier
for cat in city_venues['Venue Category']:
    
    if cat in gymCat:
        city_venues.loc[city_venues['Venue Category'] == cat, ['Is Gym']] = 1
    elif cat == 'Health Food Store':
        city_venues.loc[city_venues['Venue Category'] == cat, ['Is Gym']] = 2
    else:
        city_venues.loc[city_venues['Venue Category'] == cat, ['Is Gym']] = 0

# printing sample updated data frame
city_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Is Gym
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place,0
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio,1
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner,0
3,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop,0
4,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop,0


#### Printing a map of Manhattan with all venues in Blue and Gym/Fitness Centers in Red

In [64]:
mapNY = folium.Map(location=city_center, zoom_start=12)
folium.Marker(city_center, popup='City Center').add_to(mapNY)
for lat, lon, is_gym in zip(city_venues['Venue Latitude'], city_venues['Venue Longitude'], city_venues['Is Gym']):
    color = 'red' if is_gym == 1 else 'blue' if is_gym == 2 else 'green'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(mapNY)
mapNY

### Data Summary <a name="datasummary"></a>

What we have learnt so far, there are **40 neighborhoods** across Manhattan. The total recommended venues of all types are **2,872**; out of which **198** are Gym/Fitness Centers of all kinds. At an average, there are close to **5** Gym/Fitness Centers per neighborhood.

We would be analyzing the data further to determine the exact distribution across neighborhoods. **At this point, it would be safe to assume, neighborhoods with more than 4 Gym/Fitness Centers can be considered health-concious.**

In [65]:
totVenues = len(city_venues['Venue'].unique())
totGyms = len(city_venues[city_venues['Is Gym']==1])
totHealthFoods = len(city_venues[city_venues['Is Gym']==2])
totNeighborhoods = len(city_venues['Neighborhood'].unique())

print("Total Neighborhoods across Manhattan {}".format(totNeighborhoods))
print("Total Venues across Manhattan {}".format(totVenues))
print("Total Gyms across Manhattan {}".format(totGyms))
print("Total Health Food Centers across Manhattan {}".format(totHealthFoods))
print("Total Gyms per Neighborhoods across Manhattan {}".format((totGyms/totNeighborhoods)))

Total Neighborhoods across Manhattan 40
Total Venues across Manhattan 2872
Total Gyms across Manhattan 198
Total Health Food Centers across Manhattan 4
Total Gyms per Neighborhoods across Manhattan 4.95


### Next Steps

We would be analyzing the data to determine, what type of fitness center is popular among the neighborhoods. If feasible, we would try and expand this analysis to other boroughs of New York City.