# Battle of the neighborhoods: Tattoo Parlor Location

# Introduction/Background
The problem that I am trying to solve involves tattoo artists who became very close during their tenure at school and have come together close to graduation to decide where they should open a Tattoo Parlor of their own in either San Francisco, NYC or Toronto. They chose these main cities because all of them hail from one of these cities and so moving there would not be too much of a compromise since they would be in the home city of one of them. However, these individuals need help in identifying where they should move.  The main predictor for their moving is potential for a new market. As a result, they hope to see if there are potential neighborhoods in each city where they could set up shop. 

Halfway through the project,two of the artists decide to get married and one of the spouses relocates from Toronto, making Toronto an invalid option. As a result, the choices end up becoming San Francisco and New York. Because there are only two options at the end, the artists want need another way to evaluate where they should move, and they task us with adding crime incidence data,as an added factor. As a result, the project initially involved Toronto, but the crime analysis is directed only at SF and NYC. 

Data about Toronto was extracted from Wikipedia using BeautifulSoup

In [1]:
import urllib.request

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
req = urllib.request.urlopen(url)
article = req.read().decode()

with open('List_of_postal_codes_of_Canada:_M.html', 'w') as fo:
    fo.write(article)

In [2]:
from bs4 import BeautifulSoup

article = open('List_of_postal_codes_of_Canada:_M.html').read()
soup = BeautifulSoup(article, 'html.parser')
tables = soup.find_all('table', class_='sortable')

for table in tables:
    ths = table.find_all('th')
    headings = [th.text.strip() for th in ths]
    if headings[:3] == ['Postcode', 'Borough', 'Neighbourhood']:
        break
        
with open('List_of_postal_codes_of_Canada:_M.csv', 'w') as fo:
    for tr in table.find_all('tr'):
        tds = tr.find_all('td')
        if not tds:
            continue
        Postcode, Borough, Neighbourhood = [td.text.strip() for td in tds[:3]]
        print(','.join([Postcode, Borough, Neighbourhood]), file=fo)

Attaching the headers to the list of postal codes

In [3]:
import pandas as pd

headers=["Postcode","Borough","Neighborhood"]

df=pd.read_csv('List_of_postal_codes_of_Canada:_M.csv',names=headers)

In [4]:
#df.head()

Missing data examination; then dropping the missing values

In [5]:
missing_data=df.isnull()
#for column in missing_data.columns.values.tolist():
   # print(column)
    #print(missing_data[column].value_counts())
   # print("")

In [6]:
df.dropna(subset=["Borough"], axis=0, inplace=True)

df.reset_index(drop=True, inplace=True)

Assign missing neighborhoods to name of corresponding borough, then concatenate by Postcode and reset the index

In [7]:
import numpy as np
df['Neighborhood'].replace(np.nan, df['Borough'], inplace=True)

In [8]:
df1= df.groupby('Postcode').agg(lambda x: ','.join(x))
df2=df1.reset_index()
df2['Borough']= df2['Borough'].str.replace('[{}\s]','').str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",")
#df2.head()

Import visualization and clustering libraries

In [9]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes
import folium
print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    cer

Import more libraries for further examination of the data, especially the json libraries since the data from Foursquare will be generated as json files. 

In [10]:
import json
from pandas.io.json import json_normalize 
from IPython.display import Image 
from IPython.core.display import HTML 
import random

Foursquare credential definition in order to access the Foursquare API

In [11]:
# The code was removed by Watson Studio for sharing.

First address requested is Toronto, Ontario. The geolocator is defined, with location and latitude declared. The result will be the latitude and longitude coordinates of Toronto

In [12]:
address = 'Toronto, Ontario'
geolocator = Nominatim(user_agent="TOexplorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

43.653963 -79.387207


A search query for Tattoo Parlors in Toronto is defined so that it can later be superimposed on a map. 

In [13]:
search_query = 'Tattoo Parlor'
#radius = 800
#url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, LIMIT)

The query using the url will yield results, and the results will be made into a json table, and that json table will be loaded into a dataframe

In [14]:
results =requests.get(url).json()
venues = results['response']['venues']
dframe_0=json_normalize(venues)

Then filter columns and form a dataframe with the filtered columns. After that we use a function to get names and categories of different venues in a query. Then in order to get the categories added to the originally created dataframe, we create a new column with the new function. 

In [16]:
filtered_columns = ['name', 'categories'] + [col for col in dframe_0.columns if col.startswith('location.')] + ['id']
dframe_0_filtered = dframe_0.loc[:, filtered_columns]

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

dframe_0_filtered['categories'] = dframe_0_filtered.apply(get_category_type, axis=1)

dframe_0_filtered.columns = [column.split('.')[-1] for column in dframe_0_filtered.columns]

#dframe_0_filtered.head()
#dframe_0_filtered.shape

Then we clean up the dataframe so that we only have four columns remaining.

In [17]:
dframe_0_filtered_clean = dframe_0_filtered[['name','address','distance','neighborhood','postalCode']]
#dframe_0_filtered_clean.head()

Use Folium to display a map showing tattoo parlours in Toronto

In [18]:
toronto_tattoo_map = folium.Map(location=[latitude, longitude], zoom_start=14)

for lat, lng, label in zip(dframe_0_filtered.lat, dframe_0_filtered.lng, dframe_0_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='green',
        fill_opacity=0.6
    ).add_to(toronto_tattoo_map)

toronto_tattoo_map

Now that we have explored Toronto, and have seen all the tattoo parlors in Toronto, we can see the distribution. We will come back to distribution after we have generated maps of San Francisco and New York City. 

We start off with New York. We have a dataset that was used in class, which we will use to generate the new york map. It has latitude and longitude data with neighborhood demarcations. 

In [19]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [20]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

After the data has been opened as a json table, we extract the neighborhood using the features element from the data. 

In [22]:
neighborhoods_data = newyork_data['features']

The we add column headers to the new dataframe that we've just created with the neighborhood data. 

In [23]:
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

neighborhoods = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
#neighborhoods.head()

Then we initiate a call to the Foursquare API to get the coordinates of NYC. 

In [24]:
address1 = 'New York City, NY'

geolocator1 = Nominatim(user_agent="ny_explorer")
location1 = geolocator1.geocode(address1)
latitude1 = location1.latitude
longitude1 = location1.longitude
print('The geograpical coordinates of New York City are {}, {}.'.format(latitude1, longitude1))

The geograpical coordinates of New York City are 40.7127281, -74.0060152.


As an optional task, we can generate a map of New York to display the neighborhoods with markers. 

In [25]:
map_newyork = folium.Map(location=[latitude1, longitude1], zoom_start=14)

for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=0.01,
        color='blue',
        popup=label,
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
#map_newyork

We instatiate a search query for Tattoo Parlors in NY, and then create a dataframe containing the venues. We then create a new dataframe with the venues filtered, then we attach a function to filter the venues by category. After that, we clean the dataframe with the columns that contain important information.

In [26]:
search_query1 = 'Tattoo Parlour'
url1 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude1, longitude1, VERSION, search_query1, LIMIT)
results1 = requests.get(url1).json()
venues1 = results1['response']['venues']
dataframe = json_normalize(venues1)

filtered_columns1 = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:,filtered_columns1]

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

#Clean the filtered dataframe to only contain required fields. 
dataframe_filtered_clean = dataframe_filtered[['name','address','distance','postalCode','formattedAddress']]
#dataframe_filtered_clean.head()
#dataframe_filtered_clean.shape

Then using the data from the original filtered dataframe, we generate a map of NYC with the locations of Tattoo Parlors.

In [27]:
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='red',
        popup=label,
        fill = True,
        fill_color='orange',
        fill_opacity=0.6
    ).add_to(map_newyork)
map_newyork

Having generated the map of NYC, the next consideration is San Francisco. We will make calls to the geolocator to get the coordinates of San Francisco. 

In [28]:
address2= 'San Francisco, California'
geolocator2 = Nominatim(user_agent='SFexplorer')
location2=geolocator2.geocode(address2)
latitude2 = location2.latitude
longitude2 = location2.longitude
print('The geographical coordinates of San Francisco are are {},{}'.format(latitude2,longitude2))

The geographical coordinates of San Francisco are are 37.7792808,-122.4192363


Then we use the Foursquare API to access data on Tattoo Parlors and generate a dataframe that contains venues in San Francisco. 

In [29]:
search_query2='Tattoo parlor'
url2 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude2, longitude2, VERSION, search_query2, LIMIT)
results2= requests.get(url2).json()
venues2=results2['response']['venues']
dataframe2=json_normalize(venues2)

Once we have the dataframe, we create a new dataframe that has filtered venues, then we attach a function to filter the venues by category. After that, we clean the dataframe with the columns that contain important information.

In [30]:
filtered_columns2=['name','categories']+[col for col in dataframe2.columns if col.startswith('location.')]+['id']
dataframe2_filtered=dataframe2.loc[:, filtered_columns2]

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

dataframe2_filtered['categories'] = dataframe2_filtered.apply(get_category_type, axis=1)

dataframe2_filtered.columns = [column.split('.')[-1] for column in dataframe2_filtered.columns]

dataframe2_filtered_cleaned = dataframe2_filtered[['name','address','formattedAddress','city']]

#dataframe2_filtered_cleaned.head()

We then generate a map showing the locations and distribution of Tattoo Parlors in San Francisco.

In [31]:
sanfran_map = folium.Map(location=[latitude2,longitude2], zoom_start=14)

for lat,lng,label in zip(dataframe2_filtered.lat,dataframe2_filtered.lng,dataframe2_filtered.categories):
    folium.features.CircleMarker(
        [lat,lng],
        radius = 5,
        color = 'purple',
        popup = label,
        fill = True,
        fill_color = 'blue',
        fill_opacity = 0.7
    ).add_to(sanfran_map)
sanfran_map

We now have maps showing the distribution of tattoo parlors in all three cities. It is now time to focus on the crime incidence in New York and San Francisco. The datasets are focused on the year 2016, and for ease of computation are limited to the first 1000 incidents in the datasets. The dataset for San Francisco was sourced from: https://cocl.us/sanfran_crime_dataset while the dataset for NYC was sourced from: https://opendata.cityofnewyork.us/. For diversity of visuals, the New York map will be made with marker clusters to indicate the crime rates in the different neighborhoods while the San Francisco data will be shown on a choropleth map.

San Franscisco data is opened as a pandas dataframe, and the dataframe is reduced to two columns with the pertinent data-'IncidntNum' and 'PdDistrict'. The fields are renamed 'Count' and 'Neighborhood' respectively. For the count to work, we have to group the incident numbers by the neighborhoods in which they occurred. 

In [35]:
filepath = "https://cocl.us/sanfran_crime_dataset"
df_crime= pd.read_csv(filepath)

limit0 = 1000
df_crime = df_crime.iloc[0:limit0, :]

df_crime=df_crime[['IncidntNum','PdDistrict']]
df_crime.rename(columns={'IncidntNum': 'Count','PdDistrict':'Neighborhood'},inplace=True)

df_crime_grouped=df_crime.groupby('Neighborhood',axis=0).count()
dfcg=df_crime_grouped.reset_index('Neighborhood')
#dfcg

Using the geo data from https://cocl.us/sanfran_geojson, we get a json table with geographical information about San Francisco. We use this data in our declaration of the San Francisco map, to produce a choropleth map, with the locations of tattoo parlors superimposed. 

In [36]:
!wget --quiet https://cocl.us/sanfran_geojson -O sanfran_geo.json

sanfran_geo=r'sanfran_geo.json'

sanfran_map.choropleth(
    geo_data=sanfran_geo,
    data=dfcg,
    columns = ['Neighborhood','Count'],
    key_on='feature.properties.DISTRICT',
    fill_color='BuGn',
    fill_opacity=0.5, 
    line_opacity=0.2,
    legend_name='Crime Incidence in San Francisco'
)
sanfran_map

The data for New York City was downloaded to my computer and then uploaded to the IDE using the recommended methodology for IBM Watson. 

In [37]:
# The code was removed by Watson Studio for sharing.

Once processing is done, a new pandas dataframe is made from the csv file.

In [38]:
df_newyork_crime = pd.read_csv(body)
#df_newyork_crime.head()

  interactivity=interactivity, compiler=compiler, result=result)


We define the dataframe to contain relevant columns only, and set our limit to 1000 for similarity with the SF dataset. 

In [40]:
df_nycrime=df_newyork_crime[['CMPLNT_NUM','OFNS_DESC','Latitude','Longitude','PATROL_BORO']]
limit = 1000
df_nycrime = df_nycrime.iloc[0:limit, :]

df_nycrime_add = df_nycrime[['CMPLNT_NUM','PATROL_BORO']]
df_nycrime_add.rename(columns={'CMPLNT_NUM': 'Count','PATROL_BORO':'Neighborhood'},inplace=True)
df_nycrime_grouped=df_nycrime_add.groupby('Neighborhood',axis=0).count()
dfnycg=df_nycrime_grouped.reset_index('Neighborhood')
#dfnycg

We import plugins from Folium for use with the cluster map for New York City. We define the plugins using the incidents. We then add the incidents to the map. 

In [41]:
from folium import plugins

newyork_incidents = plugins.MarkerCluster().add_to(map_newyork)

for lat, lng, label, in zip(df_nycrime.Latitude, df_nycrime.Longitude, df_nycrime.OFNS_DESC):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(newyork_incidents)

map_newyork