# Capstone Project Notebook

## This notebook contains the code used to discover the best city in Los Angeles County to open up a fitness store

### Installations:

In [2]:
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    certifi-2019.6.16          |           py36_1         149 KB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ca-certificates-2019.6.16  |       hecc5488_0         145 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.49-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

### Imports:

In [3]:
import pandas as pd
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # map rendering library
import json # library to handle JSON files
import requests # library to handle requests
import types
import ibm_boto3
from botocore.client import Config
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans
from folium.features import FeatureGroup

### Raw CSV Data:

First I downloaded a csv file containing information for all of the zip codes in the US from https://simplemaps.com/data/us-zips
Since this file is too large to upload in its entirety to Watson Studio, I isolated only the rows with "Los Angeles" in the "county_name" column and created a new CVS with these values. I then uploaded this as an asset for this notebook.
The next cell is hidden as it contains my credentials, however its purpose is to generate the variable "body" which is where the uszips.csv file is located in my IBM Cloud Object Storage. Next, I load the csv file into a pandas dataframe.

In [4]:
# The code was removed by Watson Studio for sharing.

In [5]:
df = pd.read_csv(body)

#### Familiarize myself with the dataframe.

In [6]:
print("Shape:\n", df.shape)
print("Column Names:\n", df.columns)
print("Distinct Cities:\n", df.city.unique())
df.head()

Shape:
 (290, 16)
Column Names:
 Index(['zip', 'lat', 'lng', 'city', 'state_id', 'state_name', 'zcta',
       'parent_zcta', 'population', 'density', 'county_fips', 'county_name',
       'all_county_weights', 'imprecise', 'military', 'timezone'],
      dtype='object')
Distinct Cities:
 ['Los Angeles' 'West Hollywood' 'Dodgertown' 'Playa Vista' 'Bell Gardens'
 'Beverly Hills' 'Compton' 'Culver City' 'Downey' 'El Segundo' 'Gardena'
 'Hawthorne' 'Hermosa Beach' 'Huntington Park' 'Lawndale' 'Lynwood'
 'Malibu' 'Manhattan Beach' 'Maywood' 'Pacific Palisades'
 'Palos Verdes Peninsula' 'Rancho Palos Verdes' 'Redondo Beach'
 'South Gate' 'Topanga' 'Venice' 'Marina Del Rey' 'Playa Del Rey'
 'Inglewood' 'Santa Monica' 'Torrance' 'Whittier' 'La Mirada' 'Montebello'
 'Norwalk' 'Pico Rivera' 'Santa Fe Springs' 'Artesia' 'Cerritos' 'Avalon'
 'Bellflower' 'Harbor City' 'Lakewood' 'Hawaiian Gardens' 'Lomita'
 'Paramount' 'San Pedro' 'Wilmington' 'Carson' 'Signal Hill' 'Long Beach'
 'Altadena' 'Arcadia

Unnamed: 0,zip,lat,lng,city,state_id,state_name,zcta,parent_zcta,population,density,county_fips,county_name,all_county_weights,imprecise,military,timezone
0,90001,33.97397,-118.24953,Los Angeles,CA,California,True,,57110,6295.8,6037,Los Angeles,{'06037':100},False,False,America/Los_Angeles
1,90002,33.94906,-118.24673,Los Angeles,CA,California,True,,51223,6458.9,6037,Los Angeles,{'06037':100},False,False,America/Los_Angeles
2,90003,33.96411,-118.2737,Los Angeles,CA,California,True,,66266,7204.7,6037,Los Angeles,{'06037':100},False,False,America/Los_Angeles
3,90004,34.07621,-118.31084,Los Angeles,CA,California,True,,62180,7876.3,6037,Los Angeles,{'06037':100},False,False,America/Los_Angeles
4,90005,34.05915,-118.30643,Los Angeles,CA,California,True,,37681,13421.3,6037,Los Angeles,{'06037':100},False,False,America/Los_Angeles


#### I notice that there are duplicate rows with the same value for "city" with only slight differences in their other values. Since I am only interested in cities as a whole, I will group the rows by city and use the average of their latitudes, longitudes, populations, and densities as the new aggregate rows.

In [7]:
aggregation_functions = {'lat': 'mean', 'lng': 'mean', 'population': 'mean', 'density' : 'mean'}
df_new = df.groupby(df['city']).aggregate(aggregation_functions)
df_new.reset_index(inplace=True)

In [8]:
df_new.head()

Unnamed: 0,city,lat,lng,population,density
0,Acton,34.46511,-118.21416,7993.0,42.9
1,Agoura Hills,34.12274,-118.75727,25488.0,288.6
2,Alhambra,34.08285,-118.136885,41528.5,4128.4
3,Altadena,34.19544,-118.13796,36126.0,1689.5
4,Arcadia,34.13232,-118.03745,32905.0,2153.1


#### Create a map that pinpoints all of the different cities in Los Angeles county, we will center this map on Los Angeles city.

In [9]:
address = 'Los Angeles, US'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Los Angeles are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Los Angeles are 34.0536909, -118.2427666.


In [10]:
# create map of Los Angeles using latitude and longitude values
map_los_angeles = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, label in zip(df_new['lat'], df_new['lng'], df_new['city']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_los_angeles)  

In [11]:
map_los_angeles

#### Define Foursquare Credentials and Version (Hidden Cell)

In [12]:
# The code was removed by Watson Studio for sharing.

#### Explore Los Angeles City with Foursquare

In [13]:
city_latitude = df_new.loc[55, 'lat'] # city latitude value
city_longitude = df_new.loc[55, 'lng'] # city longitude value

city_name = df_new.loc[55, 'city'] # city name

print('Latitude and longitude values of {} are {}, {}.'.format(city_name, 
                                                               city_latitude, 
                                                               city_longitude))

Latitude and longitude values of Los Angeles are 34.04212092307693, -118.302956.


#### Get the top 100 venues that are in Los Angeles within a radius of 500 meters.

In [14]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    city_latitude, 
    city_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=QROVKUGRXDMAMEXYRU4FZKN3XI1UUNWKLFWAFHHZ10TXAIKT&client_secret=HNPLIWDCXVYD214TBG0GOL533C22XDZT4FAEJUIPOHNP0XW4&v=20180605&ll=34.04212092307693,-118.302956&radius=500&limit=100'

In [15]:
results = requests.get(url).json()

In [16]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [17]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

In [18]:
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Blu Elefant Café,Café,34.039827,-118.303951
1,Skinny B*tch Pizza,Pizza Place,34.039816,-118.298767
2,Regular Guys Pizza,Pizza Place,34.039885,-118.298287
3,Dollar Tree,Discount Store,34.044013,-118.301424
4,Burger Factory,Burger Joint,34.039826,-118.305305


In [19]:
nearby_venues[(nearby_venues['categories']=='Boxing Gym')]

Unnamed: 0,name,categories,lat,lng


In [20]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

12 venues were returned by Foursquare.


#### Retrieve all venues from each city in Los Angeles County

In [21]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
# Get all the venues for each city
los_angeles_venues = getNearbyVenues(names=df_new['city'],
                                   latitudes=df_new['lat'],
                                   longitudes=df_new['lng']
                                    )

Acton
Agoura Hills
Alhambra
Altadena
Arcadia
Artesia
Avalon
Azusa
Baldwin Park
Bell Gardens
Bellflower
Beverly Hills
Burbank
Calabasas
Canoga Park
Canyon Country
Carson
Castaic
Cerritos
Chatsworth
Claremont
Compton
Covina
Culver City
Diamond Bar
Dodgertown
Downey
Duarte
El Monte
El Segundo
Encino
Gardena
Glendale
Glendora
Granada Hills
Hacienda Heights
Harbor City
Hawaiian Gardens
Hawthorne
Hermosa Beach
Huntington Park
Inglewood
La Canada Flintridge
La Crescenta
La Mirada
La Puente
La Verne
Lake Hughes
Lakewood
Lancaster
Lawndale
Littlerock
Llano
Lomita
Long Beach
Los Angeles
Lynwood
Malibu
Manhattan Beach
Marina Del Rey
Maywood
Mission Hills
Monrovia
Montebello
Monterey Park
Montrose
Newhall
North Hills
North Hollywood
Northridge
Norwalk
Pacific Palisades
Pacoima
Palmdale
Palos Verdes Peninsula
Panorama City
Paramount
Pasadena
Pearblossom
Pico Rivera
Playa Del Rey
Playa Vista
Pomona
Porter Ranch
Rancho Palos Verdes
Redondo Beach
Reseda
Rosemead
Rowland Heights
San Dimas
San Fernando


In [27]:
print(los_angeles_venues.shape)
los_angeles_venues.head()

(1522, 7)


Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Agoura Hills,34.12274,-118.75727,Final Construction Clean Up,34.123121,-118.757486,Construction & Landscaping
1,Agoura Hills,34.12274,-118.75727,RG Electric Services - Agoura Hills Electrical...,34.123963,-118.757025,Electronics Store
2,Agoura Hills,34.12274,-118.75727,Bwana Trail,34.121424,-118.753927,Trail
3,Agoura Hills,34.12274,-118.75727,Dead Man Overlook,34.123557,-118.76103,Scenic Lookout
4,Agoura Hills,34.12274,-118.75727,Medicine Woman Trail,34.118563,-118.757534,Trail


#### Clean up the dataframe by one hot encoding the venues for each city according to their venue category

In [28]:
# one hot encoding
los_angeles_onehot = pd.get_dummies(los_angeles_venues[['Venue Category']], prefix="", prefix_sep="")

# add city column back to dataframe
los_angeles_onehot['City'] = los_angeles_venues['City'] 

# move city column to the first column
fixed_columns = [los_angeles_onehot.columns[-1]] + list(los_angeles_onehot.columns[:-1])
los_angeles_onehot = los_angeles_onehot[fixed_columns]

los_angeles_onehot.head()

Unnamed: 0,City,ATM,Adult Boutique,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,Agoura Hills,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Agoura Hills,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Agoura Hills,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Agoura Hills,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Agoura Hills,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
los_angeles_onehot.shape

(1522, 248)

#### Next, group rows by city use the mean of each category as the new venue values

In [30]:
los_angeles_grouped = los_angeles_onehot.groupby('City').mean().reset_index()
los_angeles_grouped.head()

Unnamed: 0,City,ATM,Adult Boutique,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Arts & Entertainment,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,Agoura Hills,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Alhambra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Altadena,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arcadia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Artesia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.066667,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Next since I am only interested in gym and park venues, I will combine all gym venues and outdoor park/trail venues

In [31]:
los_angeles_grouped['Gyms'] = los_angeles_grouped['Gym / Fitness Center'] + los_angeles_grouped['Gym'] + los_angeles_grouped['Gymnastics Gym'] + los_angeles_grouped['Boxing Gym']
los_angeles_grouped['Parks'] = los_angeles_grouped['Park'] + los_angeles_grouped['Trail']

#### Then I sort them first based on the prevalence of gyms and then on parks

In [32]:
los_angeles_grouped = los_angeles_grouped[['City','Gyms','Parks']].sort_values(['Gyms', 'Parks'], ascending=[False, False])

In [33]:
los_angeles_grouped.head()

Unnamed: 0,City,Gyms,Parks
46,Lynwood,0.25,0.0
93,Tarzana,0.25,0.0
17,Culver City,0.222222,0.0
23,El Segundo,0.2,0.0
24,Encino,0.2,0.0


#### Next I separate the dataframe with cities that have values for parks and gyms and those that have no values, which are areas that we shouldn't open up a fitness center
#### We also need the latitude and longitude for each city in order to mark them on the map, so I combine these values from the original dataset to our current 'los_angeles_grouped' dataframe

In [34]:
df_new.rename(columns={'city':'City'},inplace=True)

In [35]:
los_angeles_merged = df_new[['City','lat','lng']]
# merge los_angeles_grouped with los_angeles_data to add lat/lng for each neighborhood
los_angeles_merged = los_angeles_grouped.join(los_angeles_merged.set_index('City'), on='City')

In [36]:
los_angeles_none = los_angeles_merged[ (los_angeles_merged['Gyms'] == 0) & (los_angeles_merged['Parks'] == 0) ]

In [37]:
indexNames = los_angeles_merged[ (los_angeles_merged['Gyms'] == 0) & (los_angeles_merged['Parks'] == 0) ].index
los_angeles_merged.drop(indexNames , inplace=True)

In [39]:
los_angeles_merged.head()

Unnamed: 0,City,Gyms,Parks,lat,lng
46,Lynwood,0.25,0.0,33.92365,-118.20053
93,Tarzana,0.25,0.0,34.15508,-118.54751
17,Culver City,0.222222,0.0,34.008005,-118.39321
23,El Segundo,0.2,0.0,33.91695,-118.40206
24,Encino,0.2,0.0,34.15563,-118.50449


#### Now I create the map. Itake the dataframe, 'los_angeles_megered', which contains all the cities with non zero values in at least one of the 'Gyms' or 'Parks' columns, and I divide it into 3 sections: Great Locations, Good Locations, and Average Locations
#### The other dataframe 'los_angeles_none' contains all of the cities that have no values for 'Gyms' or 'Parks', therefore they will be labeled as Bad Locations to open up a fitness center.
#### Next I add all the cities in their respective categories as markers onto the map and one can utilize the legend in the top right hand corner of the map to view any combinations of locations ranging from 'Great Locations' to 'Bad Locations'

In [38]:
address = 'Los Angeles, US'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

feature_group = FeatureGroup(name='Great Locations')
feature_group2 = FeatureGroup(name='Good Locations')
feature_group3 = FeatureGroup(name='Average Locations')
feature_group4 = FeatureGroup(name='Bad Locations')

# create map of Los Angeles using latitude and longitude values
map_los_angeles = folium.Map(location=[latitude, longitude], zoom_start=10)

# create groups, split LA cities with data on gyms or parks into 4 groups
count = 0

# get various colored markers based on list position
def getMarker(lat, lng, label, given_color):
    label = folium.Popup(label, parse_html=True)
    marker = folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=given_color,
            fill=True,
            fill_color=given_color,
            fill_opacity=0.7,
            parse_html=False)
    return(marker)


# add markers to map from dataframe with values for Gyms and Parks
for lat, lng, label in zip(los_angeles_merged['lat'], los_angeles_merged['lng'], los_angeles_merged['City']):
    count = count + 1
    if (count < 15):
        getMarker(lat,lng,label,'green').add_to(feature_group)  
    elif (count < 30):
        getMarker(lat,lng,label,'blue').add_to(feature_group2)
    else:
        getMarker(lat,lng,label,'yellow').add_to(feature_group3)

# add markers to map from dataframe with no values for Gyms and Parks
for lat, lng, label in zip(los_angeles_none['lat'], los_angeles_none['lng'], los_angeles_none['City']):
    getMarker(lat,lng,label,'red').add_to(feature_group4)

# create the legend
map_los_angeles.add_child(feature_group)
map_los_angeles.add_child(feature_group2)
map_los_angeles.add_child(feature_group3)
map_los_angeles.add_child(feature_group4)
map_los_angeles.add_child(folium.map.LayerControl())