# Similar district
## Coursera Capstone Project
### Sergii Guzenko

## Introduction/Business Problem
One family with a little child decided to move from Turin, Italy to Manhattan, NY. They are looking for the most suitable neighbourhood. They want to reduce an impact of the relocation on their life, as well as work-life balance. That's why they asked me to compare their previous home city with a new one and indicate similar neighbourhoods.

We discussed and created a criteria's list to consider during research with presence and distance to/from:
- schools/kindergartens;
- parks/playground for children;
- gyms/swimming pools;
- supermarkets/grocery shops;
- train and bus stations;
- airport;
- restaurants;
- landmarks

In the future, we can use this model to find a similar districts in another city or country to suggest
- relocation options;
- investment solutions;
- solve urban problems;

## Data type and sources
I will use data from Foursquare to qualify and cluster neighbourhoods:
- revues based on type
- distance from center of the neighborhood

I will check other sources for crime rates, subwaystations ect. <br>
Here is some examples: <br>
Chicago crime https://data.world/publicsafety/chicago-crime/file/chicago_crime_2014.csv or https://home.chicagopolice.org/statistics-data/public-arrest-data/ <br>
A subway metro stops https://en.wikipedia.org/wiki/List_of_New_York_City_Subway_stations_in_Manhattan <br>
NYC open data for school https://data.cityofnewyork.us/Education/2017-2018-School-Locations/p6h4-mpyy

### Upload Libraries Required

In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import requests # library to handle requests
import urllib.request
import time

#!conda install -c conda-forge beautifulsoup4 --y
#from bs4 import BeautifulSoup

#!conda install -c conda-forge lxml --y
#from lxml import etree

#from urllib.request import urlopen

#!conda install -c conda-forge geopy --yes # uncomment this line if you didn't intall folium before
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install --override-channels -c main -c conda-forge folium=0.11.0 --yes
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you didn't intall folium before
import folium # map rendering library
from folium import plugins

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import seaborn as sns

# import k-means from clustering stage
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


## Turin Map - Current residence and venues in neighborhood

for comparison to future Manhattan neighborhood

In [2]:
address = 'Corso Racconigi 28, Torino TO, Italy'
geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Italy home are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Italy home are 45.0717654, 7.6468397.


In [3]:
TU_neighborhood_latitude=latitude
TU_neighborhood_longitude=longitude

## Dial FourSquare to find venues around current residence in Turin

In [4]:
# The code was removed by Watson Studio for sharing.

In [5]:
LIMIT = 250 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    TU_neighborhood_latitude, 
    TU_neighborhood_longitude, 
    radius, 
    LIMIT)
#url # display URL

In [6]:
# results display is hidden for report simplification 
results = requests.get(url).json()
#results

##### function that extracts the category of the venue - borrow from the Foursquare lab

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [7]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [8]:
venues = results['response']['groups'][0]['items']
TUnearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.location.neighborhood','venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
TUnearby_venues =TUnearby_venues.loc[:, filtered_columns]
# filter the category for each row
TUnearby_venues['venue.categories'] = TUnearby_venues.apply(get_category_type, axis=1)
# clean columns
TUnearby_venues.columns = [col.split(".")[-1] for col in TUnearby_venues.columns]

TUnearby_venues.shape

(52, 5)

Quickly examine the resulting dataframe.

In [9]:
# Venues near current Turin residence place
TUnearby_venues['neighborhood'] = 'Italian home'
TUnearby_venues.head(10)

Unnamed: 0,neighborhood,name,categories,lat,lng
0,Italian home,Osteria Antiche Sere,Piedmontese Restaurant,45.071046,7.643011
1,Italian home,Brasserie de La Mer,French Restaurant,45.071297,7.646836
2,Italian home,Vale un Perù,Peruvian Restaurant,45.07026,7.645671
3,Italian home,Bar Torrefazione Ferrucci,Coffee Shop,45.067947,7.655234
4,Italian home,Piola da Celso,Piedmontese Restaurant,45.066948,7.647337
5,Italian home,Parco della Tesoriera,Park,45.076597,7.638373
6,Italian home,Plin & Tajarin,Piedmontese Restaurant,45.073978,7.657748
7,Italian home,Hamburgeria,Burger Joint,45.065308,7.647515
8,Italian home,Wasabi,Japanese Restaurant,45.066104,7.655126
9,Italian home,Teatro Astra,Theater,45.07734,7.650184


In [10]:
#TUnearby_venues.groupby('categories').count()

#### Let's find out how many unique categories can be curated from all the returned venues

In [11]:
print('There are {} uniques categories.'.format(len(TUnearby_venues['categories'].unique())))

There are 30 uniques categories.


#### Let's Analyze Neighborhood

In [12]:
# one hot encoding
TU_onehot = pd.get_dummies(TUnearby_venues[['categories']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
TU_onehot['neighborhood']=TUnearby_venues['neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [TU_onehot.columns[-1]] + list(TU_onehot.columns[:-1])
TU_onehot = TU_onehot[fixed_columns]


TU_onehot.head()

Unnamed: 0,neighborhood,Asian Restaurant,Burger Joint,Bus Station,Café,Chinese Restaurant,Cocktail Bar,Coffee Shop,Deli / Bodega,Food Truck,French Restaurant,Greek Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Jewelry Store,Karaoke Bar,Kebab Restaurant,Market,Metro Station,Movie Theater,Park,Peruvian Restaurant,Piedmontese Restaurant,Pizza Place,Plaza,Pub,Restaurant,Sandwich Place,Sushi Restaurant,Theater
0,Italian home,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,Italian home,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Italian home,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,Italian home,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Italian home,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


#### Next, let's group rows by taking the mean of the frequency of occurrence of each category

In [13]:
TU_grouped = TU_onehot.groupby('neighborhood').mean().reset_index()
TU_grouped

Unnamed: 0,neighborhood,Asian Restaurant,Burger Joint,Bus Station,Café,Chinese Restaurant,Cocktail Bar,Coffee Shop,Deli / Bodega,Food Truck,French Restaurant,Greek Restaurant,Ice Cream Shop,Italian Restaurant,Japanese Restaurant,Jewelry Store,Karaoke Bar,Kebab Restaurant,Market,Metro Station,Movie Theater,Park,Peruvian Restaurant,Piedmontese Restaurant,Pizza Place,Plaza,Pub,Restaurant,Sandwich Place,Sushi Restaurant,Theater
0,Italian home,0.019231,0.038462,0.076923,0.038462,0.057692,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.057692,0.057692,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.057692,0.115385,0.076923,0.019231,0.019231,0.038462,0.019231,0.019231


#### Let's print each neighborhood along with the top 5 most common venues

In [14]:
num_top_venues = 5

for hood in TU_grouped['neighborhood']:
    print("----"+hood+"----")
    temp = TU_grouped[TU_grouped['neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Italian home----
                venue  freq
0         Pizza Place  0.12
1               Plaza  0.08
2         Bus Station  0.08
3  Italian Restaurant  0.06
4  Chinese Restaurant  0.06




#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [15]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [16]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
TU_venues_sorted = pd.DataFrame(columns=columns)
TU_venues_sorted['Neighborhood'] = TU_grouped['neighborhood']

for ind in np.arange(TU_grouped.shape[0]):
    TU_venues_sorted.iloc[ind, 1:] = return_most_common_venues(TU_grouped.iloc[ind, :], num_top_venues)

TU_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Italian home,Pizza Place,Bus Station,Plaza,Chinese Restaurant,Japanese Restaurant,Piedmontese Restaurant,Italian Restaurant,Burger Joint,Café,Sandwich Place


### Map of Turin residence place with venues in Neighborhood - for reference

In [17]:
# create map of Turin place  using latitude and longitude values
map_tu = folium.Map(width=700, height=700, location=[TU_neighborhood_latitude, TU_neighborhood_longitude], zoom_start=15)
# add markers to map
for lat, lng, label in zip(TUnearby_venues['lat'], TUnearby_venues['lng'], TUnearby_venues['name']):
    label = folium.Popup(label, parse_html=True)
    folium.RegularPolygonMarker(
        [lat, lng],
        number_of_sides=30,
        radius=7,
        popup=label,
        color='blue',
        fill_color='blue',
        fill_opacity=0.8,
    ).add_to(map_tu)  
    
map_tu

## MANHATTAN NEIGHBORHOODS - DATA AND MAPPING

Cluster neighborhood data was produced with Foursquare during course lab work. A csv file was produced containing the neighborhoods around the 40 Boroughs. Now, the csv file is just read for convenience and consolidation of report.

In [18]:
# Read csv file with clustered neighborhoods with geodata
manhattan_data  = pd.read_csv('https://raw.githubusercontent.com/fint113/Coursera_Capstone/master/mh_neighboorhoods_data.csv') 
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
0,Manhattan,Marble Hill,40.876551,-73.91066,2
1,Manhattan,Chinatown,40.715618,-73.994279,2
2,Manhattan,Washington Heights,40.851903,-73.9369,4
3,Manhattan,Inwood,40.867684,-73.92121,3
4,Manhattan,Hamilton Heights,40.823604,-73.949688,0


In [19]:
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels
0,Manhattan,Marble Hill,40.876551,-73.91066,2
1,Manhattan,Chinatown,40.715618,-73.994279,2
2,Manhattan,Washington Heights,40.851903,-73.9369,4
3,Manhattan,Inwood,40.867684,-73.92121,3
4,Manhattan,Hamilton Heights,40.823604,-73.949688,0


#### Manhattan Borough neighborhoods - data with top 10 clustered venues

In [20]:
manhattan_merged = pd.read_csv('https://raw.githubusercontent.com/fint113/Coursera_Capstone/master/manhattan_merged.csv')
manhattan_merged.shape

(40, 15)

## Map of Manhattan neighborhoods with top 10 clustered venues

#### popus allow to identify each neighborhood and the cluster of venues around it in order to proceed to examine in more detail in the next cell

In [21]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


In [22]:
MA_neighborhood_latitude=latitude
MA_neighborhood_longitude=longitude

kclusters=5
map_clusters = folium.Map(width=500, height=700, location=[MA_neighborhood_latitude, MA_neighborhood_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_merged['Latitude'], manhattan_merged['Longitude'], manhattan_merged['Neighborhood'], manhattan_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
  # add markers for rental places to map
          
map_clusters

## Examine a paticular Cluster - print venues

#### Cluster 1

In [23]:
#manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 0, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

#### Cluster 2

In [24]:
#manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 1, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

#### Cluster 3

In [25]:
#manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 2, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

#### Cluster 4

In [26]:
#manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 3, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

#### Cluster 5

In [27]:
manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Washington Heights,Café,Bakery,Mobile Phone Shop,Pizza Place,Sandwich Place,Park,Gym,Latin American Restaurant,Tapas Restaurant,Mexican Restaurant
7,East Harlem,Mexican Restaurant,Bakery,Latin American Restaurant,Deli / Bodega,Thai Restaurant,French Restaurant,Café,Taco Place,Street Art,Steakhouse
11,Roosevelt Island,Coffee Shop,Sandwich Place,Park,Japanese Restaurant,Kosher Restaurant,Greek Restaurant,Baseball Field,Gym,Outdoors & Recreation,Dog Run
13,Lincoln Square,Theater,Gym / Fitness Center,Concert Hall,Plaza,Italian Restaurant,French Restaurant,Café,Opera House,Indie Movie Theater,Park
15,Midtown,Hotel,Theater,Coffee Shop,Steakhouse,Food Truck,Cocktail Bar,Clothing Store,Spa,Bookstore,Sporting Goods Shop
19,East Village,Ice Cream Shop,Bar,Wine Bar,Mexican Restaurant,Cocktail Bar,Pizza Place,Coffee Shop,Chinese Restaurant,Speakeasy,Vegetarian / Vegan Restaurant
20,Lower East Side,Chinese Restaurant,Coffee Shop,Café,Bakery,Latin American Restaurant,Park,Cocktail Bar,Japanese Restaurant,Pizza Place,Ramen Restaurant
21,Tribeca,American Restaurant,Italian Restaurant,Park,Spa,Café,Boutique,Wine Bar,Coffee Shop,Greek Restaurant,Gym
22,Little Italy,Bakery,Café,Yoga Studio,Cocktail Bar,Sandwich Place,Salon / Barbershop,Pizza Place,Ice Cream Shop,Seafood Restaurant,Chinese Restaurant
25,Manhattan Valley,Coffee Shop,Bar,Pizza Place,Chinese Restaurant,Indian Restaurant,Italian Restaurant,Thai Restaurant,Deli / Bodega,Mexican Restaurant,Yoga Studio


#### After examining several cluster data , I concluded that cluster # 5 resembles closer the Italian place, therefore providing guidance where to look for the future.

In [28]:
final_merged=manhattan_merged.loc[manhattan_merged['Cluster Labels'] == 4, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]]
final_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Washington Heights,Café,Bakery,Mobile Phone Shop,Pizza Place,Sandwich Place,Park,Gym,Latin American Restaurant,Tapas Restaurant,Mexican Restaurant
7,East Harlem,Mexican Restaurant,Bakery,Latin American Restaurant,Deli / Bodega,Thai Restaurant,French Restaurant,Café,Taco Place,Street Art,Steakhouse
11,Roosevelt Island,Coffee Shop,Sandwich Place,Park,Japanese Restaurant,Kosher Restaurant,Greek Restaurant,Baseball Field,Gym,Outdoors & Recreation,Dog Run
13,Lincoln Square,Theater,Gym / Fitness Center,Concert Hall,Plaza,Italian Restaurant,French Restaurant,Café,Opera House,Indie Movie Theater,Park
15,Midtown,Hotel,Theater,Coffee Shop,Steakhouse,Food Truck,Cocktail Bar,Clothing Store,Spa,Bookstore,Sporting Goods Shop
19,East Village,Ice Cream Shop,Bar,Wine Bar,Mexican Restaurant,Cocktail Bar,Pizza Place,Coffee Shop,Chinese Restaurant,Speakeasy,Vegetarian / Vegan Restaurant
20,Lower East Side,Chinese Restaurant,Coffee Shop,Café,Bakery,Latin American Restaurant,Park,Cocktail Bar,Japanese Restaurant,Pizza Place,Ramen Restaurant
21,Tribeca,American Restaurant,Italian Restaurant,Park,Spa,Café,Boutique,Wine Bar,Coffee Shop,Greek Restaurant,Gym
22,Little Italy,Bakery,Café,Yoga Studio,Cocktail Bar,Sandwich Place,Salon / Barbershop,Pizza Place,Ice Cream Shop,Seafood Restaurant,Chinese Restaurant
25,Manhattan Valley,Coffee Shop,Bar,Pizza Place,Chinese Restaurant,Indian Restaurant,Italian Restaurant,Thai Restaurant,Deli / Bodega,Mexican Restaurant,Yoga Studio


In [29]:
final_data=pd.DataFrame()
for test in final_merged['Neighborhood']:
    final_data=final_data.append(manhattan_data.loc[manhattan_data['Neighborhood'] == test, manhattan_data.columns[[1] + [2] + [3] + [4]]])
final_data

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels
2,Washington Heights,40.851903,-73.9369,4
7,East Harlem,40.792249,-73.944182,4
11,Roosevelt Island,40.76216,-73.949168,4
13,Lincoln Square,40.773529,-73.985338,4
15,Midtown,40.754691,-73.981669,4
19,East Village,40.727847,-73.982226,4
20,Lower East Side,40.717807,-73.98089,4
21,Tribeca,40.721522,-74.010683,4
22,Little Italy,40.719324,-73.997305,4
25,Manhattan Valley,40.797307,-73.964286,4


# Map of Manhattan schools

#### Manhattan school locations (address) was obtained from webscrapping site NYC Open Data.

In [30]:
MA_schools_df = pd.read_csv('https://data.cityofnewyork.us/api/views/p6h4-mpyy/rows.csv')
MA_schools_df = MA_schools_df[['LOCATION_NAME','Location 1','NTA_NAME','LOCATION_CATEGORY_DESCRIPTION']]
MA_schools_df.columns = ['school_name','location','Neighborhood','school_type']
MA_schools_df.dropna(inplace = True)
MA_schools_df.head(5)

Unnamed: 0,school_name,location,Neighborhood,school_type
0,P.S. 015 Roberto Clemente,"333 EAST 4 STREET\nMANHATTAN, NY 10009\n(40.72...",Lower East Side ...,Elementary
1,P.S. 019 Asher Levy,"185 1 AVENUE\nMANHATTAN, NY 10003\n(40.730009,...",East Village ...,Elementary
2,P.S. 020 Anna Silver,"166 ESSEX STREET\nMANHATTAN, NY 10002\n(40.721...",Chinatown ...,Elementary
3,P.S. 034 Franklin D. Roosevelt,"730 EAST 12 STREET\nMANHATTAN, NY 10009\n(40.7...",Lower East Side ...,K-8
4,The STAR Academy - P.S.63,"121 EAST 3 STREET\nMANHATTAN, NY 10009\n(40.72...",East Village ...,Elementary


In [31]:
split1 = MA_schools_df['location'].str.split(r'\n()', expand=True)
MA_schools_df.drop(columns='location', inplace=True)
split1.head()

Unnamed: 0,0,1,2,3,4
0,333 EAST 4 STREET,,"MANHATTAN, NY 10009",,"(40.722075, -73.978747)"
1,185 1 AVENUE,,"MANHATTAN, NY 10003",,"(40.730009, -73.984496)"
2,166 ESSEX STREET,,"MANHATTAN, NY 10002",,"(40.721305, -73.986312)"
3,730 EAST 12 STREET,,"MANHATTAN, NY 10009",,"(40.726008, -73.975058)"
4,121 EAST 3 STREET,,"MANHATTAN, NY 10009",,"(40.72444, -73.986214)"


In [32]:
MA_schools_df[['address1','address2']] = split1[[0,2]]
MA_schools_df[['latitude','longitude']] = split1[4].str.split('[(,)]',expand=True)[[1,2]].astype('float64')
print(MA_schools_df.shape)
MA_schools_df.head()

(1822, 7)


Unnamed: 0,school_name,Neighborhood,school_type,address1,address2,latitude,longitude
0,P.S. 015 Roberto Clemente,Lower East Side ...,Elementary,333 EAST 4 STREET,"MANHATTAN, NY 10009",40.722075,-73.978747
1,P.S. 019 Asher Levy,East Village ...,Elementary,185 1 AVENUE,"MANHATTAN, NY 10003",40.730009,-73.984496
2,P.S. 020 Anna Silver,Chinatown ...,Elementary,166 ESSEX STREET,"MANHATTAN, NY 10002",40.721305,-73.986312
3,P.S. 034 Franklin D. Roosevelt,Lower East Side ...,K-8,730 EAST 12 STREET,"MANHATTAN, NY 10009",40.726008,-73.975058
4,The STAR Academy - P.S.63,East Village ...,Elementary,121 EAST 3 STREET,"MANHATTAN, NY 10009",40.72444,-73.986214


In [33]:
MA_schools_df = MA_schools_df[MA_schools_df.latitude != 0]
MA_schools_df['address2'] = MA_schools_df['address2'].map(lambda x: x.rstrip(' 0123456789'))
MA_schools_df = MA_schools_df[MA_schools_df.address2 == 'MANHATTAN, NY']
#MA_schools_df.sort_values(by=['longitude'], ascending=False).head()
print(MA_schools_df.shape)
MA_schools_df.head()
#MA_schools_df.groupby('address2').count()

(348, 7)


Unnamed: 0,school_name,Neighborhood,school_type,address1,address2,latitude,longitude
0,P.S. 015 Roberto Clemente,Lower East Side ...,Elementary,333 EAST 4 STREET,"MANHATTAN, NY",40.722075,-73.978747
1,P.S. 019 Asher Levy,East Village ...,Elementary,185 1 AVENUE,"MANHATTAN, NY",40.730009,-73.984496
2,P.S. 020 Anna Silver,Chinatown ...,Elementary,166 ESSEX STREET,"MANHATTAN, NY",40.721305,-73.986312
3,P.S. 034 Franklin D. Roosevelt,Lower East Side ...,K-8,730 EAST 12 STREET,"MANHATTAN, NY",40.726008,-73.975058
4,The STAR Academy - P.S.63,East Village ...,Elementary,121 EAST 3 STREET,"MANHATTAN, NY",40.72444,-73.986214


### Map of schools and neighbourhood clusters in Manhattan

In [34]:
from folium import plugins

map_schools = folium.Map(width=500, height=700, location=[MA_neighborhood_latitude, MA_neighborhood_longitude], zoom_start=12)

# instantiate a mark cluster object for the incidents in the dataframe
schools = plugins.MarkerCluster().add_to(map_schools)

for lat, lng, label in zip(MA_schools_df['latitude'], MA_schools_df['longitude'], MA_schools_df['school_type']):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(schools)

markers_colors = []
for lat, lon, poi, cluster in zip(final_data['Latitude'], final_data['Longitude'], final_data['Neighborhood'], final_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_schools)
    
map_schools

#### After examining several neighbourhoods in cluster #5, I concluded that 7 of 14 (*Lower East Side*, *East Village*, *Washington Heights*, *East Harlem*, *Manhattan Valley*, *Carnegie Hill*, *Lincoln Square*) have a great schools choice, therefore guiding where to look for the future.

In [35]:
final_NB=['Lower East Side', 'East Village', 'Washington Heights', 'East Harlem', 'Manhattan Valley', 'Carnegie Hill', 'Lincoln Square']
final_NB

['Lower East Side',
 'East Village',
 'Washington Heights',
 'East Harlem',
 'Manhattan Valley',
 'Carnegie Hill',
 'Lincoln Square']

In [36]:
final_merged=pd.DataFrame()
final_data=pd.DataFrame()
for test in final_NB:
    final_merged=final_merged.append(manhattan_merged.loc[manhattan_merged['Neighborhood'] == test, manhattan_merged.columns[[1] + list(range(5, manhattan_merged.shape[1]))]])
    final_data=final_data.append(manhattan_data.loc[manhattan_data['Neighborhood'] == test, manhattan_data.columns[[1] + [2] + [3] + [4]]])
    
final_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Lower East Side,Chinese Restaurant,Coffee Shop,Café,Bakery,Latin American Restaurant,Park,Cocktail Bar,Japanese Restaurant,Pizza Place,Ramen Restaurant
19,East Village,Ice Cream Shop,Bar,Wine Bar,Mexican Restaurant,Cocktail Bar,Pizza Place,Coffee Shop,Chinese Restaurant,Speakeasy,Vegetarian / Vegan Restaurant
2,Washington Heights,Café,Bakery,Mobile Phone Shop,Pizza Place,Sandwich Place,Park,Gym,Latin American Restaurant,Tapas Restaurant,Mexican Restaurant
7,East Harlem,Mexican Restaurant,Bakery,Latin American Restaurant,Deli / Bodega,Thai Restaurant,French Restaurant,Café,Taco Place,Street Art,Steakhouse
25,Manhattan Valley,Coffee Shop,Bar,Pizza Place,Chinese Restaurant,Indian Restaurant,Italian Restaurant,Thai Restaurant,Deli / Bodega,Mexican Restaurant,Yoga Studio
30,Carnegie Hill,Pizza Place,Coffee Shop,Cosmetics Shop,Café,Yoga Studio,Spa,Bar,Bookstore,French Restaurant,Gym
13,Lincoln Square,Theater,Gym / Fitness Center,Concert Hall,Plaza,Italian Restaurant,French Restaurant,Café,Opera House,Indie Movie Theater,Park


In [37]:
final_data

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels
20,Lower East Side,40.717807,-73.98089,4
19,East Village,40.727847,-73.982226,4
2,Washington Heights,40.851903,-73.9369,4
7,East Harlem,40.792249,-73.944182,4
25,Manhattan Valley,40.797307,-73.964286,4
30,Carnegie Hill,40.782683,-73.953256,4
13,Lincoln Square,40.773529,-73.985338,4


# Map of Manhattan showing the crimes and the cluster of venues

#### Manhattan crime locations (address) was obtained from webscrapping site NYC Open Data.

In [38]:
MA_crime_source = pd.read_csv('https://data.cityofnewyork.us/api/views/5uac-w243/rows.csv')

In [39]:
MA_crime_df=MA_crime_source[['ADDR_PCT_CD','BORO_NM','LAW_CAT_CD','PD_DESC','Latitude','Longitude']]
MA_crime_df.columns=['Precinct','Borough','Category','Descript','Latitude','Longitude']
MA_crime_df=MA_crime_df.dropna(axis=0)
MA_crime_df = MA_crime_df[MA_crime_df.Latitude != 0]
print(MA_crime_df.shape)
MA_crime_df.head(10)

(107981, 6)


Unnamed: 0,Precinct,Borough,Category,Descript,Latitude,Longitude
0,75,BROOKLYN,MISDEMEANOR,"LARCENY,PETIT FROM AUTO",40.656991,-73.876574
1,77,BROOKLYN,FELONY,RAPE 1,40.674583,-73.930222
2,43,BRONX,MISDEMEANOR,"LARCENY,PETIT FROM STORE-SHOPL",40.830443,-73.871349
3,40,BRONX,MISDEMEANOR,"LARCENY,PETIT FROM STORE-SHOPL",40.817878,-73.916957
4,114,QUEENS,MISDEMEANOR,ASSAULT 3,40.752011,-73.935872
5,45,BRONX,VIOLATION,"HARASSMENT,SUBD 3,4,5",40.825907,-73.821328
6,42,BRONX,FELONY,WEAPONS POSSESSION 1 & 2,40.840977,-73.899175
7,71,BROOKLYN,FELONY,WEAPONS POSSESSION 3,40.663914,-73.950977
8,44,BRONX,VIOLATION,"HARASSMENT,SUBD 3,4,5",40.833059,-73.929802
9,47,BRONX,FELONY,UNAUTHORIZED USE VEHICLE 2,40.889138,-73.86202


In [40]:
MA_crime_df_rev = MA_crime_df[MA_crime_df.Borough != 'MANHATTAN']
MA_crime_df_rev = MA_crime_df_rev[MA_crime_df_rev.Precinct > 34]
print(MA_crime_df_rev.shape)
MA_crime_df_rev.groupby('Precinct').count().head()

(81065, 6)


Unnamed: 0_level_0,Borough,Category,Descript,Latitude,Longitude
Precinct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
40,2973,2973,2973,2973,2973
41,1333,1333,1333,1333,1333
42,1957,1957,1957,1957,1957
43,2414,2414,2414,2414,2414
44,2456,2456,2456,2456,2456


In [41]:
MA_crime_df = MA_crime_df[MA_crime_df.Borough == 'MANHATTAN']
MA_crime_df = MA_crime_df[MA_crime_df.Precinct < 35]
print(MA_crime_df.shape)
MA_crime_df.groupby('Precinct').count().tail()

(26883, 6)


Unnamed: 0_level_0,Borough,Category,Descript,Latitude,Longitude
Precinct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
28,1154,1154,1154,1154,1154
30,931,931,931,931,931
32,1357,1357,1357,1357,1357
33,938,938,938,938,938
34,1234,1234,1234,1234,1234


In [42]:
MA_crime_df_rev.groupby('Precinct').count().head(2)

Unnamed: 0_level_0,Borough,Category,Descript,Latitude,Longitude
Precinct,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
40,2973,2973,2973,2973,2973
41,1333,1333,1333,1333,1333


#### Download boundaries of Police Precincts from site NYC Open Data

In [43]:
district_geo = r'https://data.cityofnewyork.us/api/geospatial/78dh-3ptz?method=export&format=GeoJSON'

In [44]:
r = requests.get(district_geo)
data = r.json()
# the result is directly a dictionary and if we examine the keys
data.keys()

dict_keys(['type', 'features'])

Check Precinct type

In [45]:
type(data['features'][0]['properties']['precinct'])

str

Convert it to integer

In [46]:
for feature in data['features']:
    feature['properties']['precinct']=int(feature['properties']['precinct'])
type(data['features'][0]['properties']['precinct'])

int

In [47]:
crimedata_rev = pd.DataFrame(MA_crime_df_rev['Precinct'].value_counts().astype(float))
crimedata_rev = crimedata_rev.reset_index()
crimedata_rev.columns = ['Precinct', 'Number']
crimedata_rev = crimedata_rev.sort_values(by=['Precinct']).astype(int)
crimedata_rev.head()

Unnamed: 0,Precinct,Number
1,40,2973
29,41,1333
7,42,1957
3,43,2414
2,44,2456


In [48]:
for i in crimedata_rev['Precinct'].tolist():
    data['features'] = [precinct for precinct in data['features'] if not precinct['properties']['precinct'] == i]

In [49]:
#data['features']

Slice data frame for total crime number per disctrict

In [None]:
crimedata0 = pd.DataFrame(MA_crime_df['Precinct'].value_counts().astype(float))
crimedata0 = crimedata0.reset_index()
crimedata0.columns = ['Precinct', 'Number']
crimedata0 = crimedata0.sort_values(by=['Precinct']).astype(int)
crimedata0.head()

Unnamed: 0,Precinct,Number
5,1,1499
14,5,943
7,6,1384
13,7,984
10,9,1156


In [None]:
crimedata0.dtypes

Precinct    int64
Number      int64
dtype: object

### Map of registered crimes in Manhattan

In [None]:
map_crime = folium.Map(width=500, height=700, location=[MA_neighborhood_latitude, MA_neighborhood_longitude], zoom_start=12)

map_crime.choropleth(
    geo_data=data,
    data=crimedata0,
    columns=['Precinct', 'Number'],
    key_on='feature.properties.precinct',
    fill_color='YlOrRd',
    fill_opacity=0.9,
    line_opacity=0.1,
    legend_name='Number of incidents per district')              

map_crime

### Map of registered crimes and neighbourhood clusters in Manhattan

In [None]:
markers_colors = []
for lat, lon, poi, cluster in zip(final_data['Latitude'], final_data['Longitude'], final_data['Neighborhood'], final_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7
    ).add_to(map_crime)
    
map_crime

#### After examining 7 neighbourhoods in cluster #5, I concluded that 4 of 7 (*Washington Heights*, *Manhattan Valley*, *Carnegie Hill*, *Lincoln Square*) have a low registered crime rate, therefore guiding where to look for the future.

In [None]:
final_NB.remove('Lower East Side')
final_NB.remove('East Village')
final_NB.remove('East Harlem')
final_NB

['Washington Heights', 'Manhattan Valley', 'Carnegie Hill', 'Lincoln Square']

In [None]:
final_merged=final_merged[final_merged['Neighborhood'].isin(final_NB)]
final_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Washington Heights,Café,Bakery,Mobile Phone Shop,Pizza Place,Sandwich Place,Park,Gym,Latin American Restaurant,Tapas Restaurant,Mexican Restaurant
25,Manhattan Valley,Coffee Shop,Bar,Pizza Place,Chinese Restaurant,Indian Restaurant,Italian Restaurant,Thai Restaurant,Deli / Bodega,Mexican Restaurant,Yoga Studio
30,Carnegie Hill,Pizza Place,Coffee Shop,Cosmetics Shop,Café,Yoga Studio,Spa,Bar,Bookstore,French Restaurant,Gym
13,Lincoln Square,Theater,Gym / Fitness Center,Concert Hall,Plaza,Italian Restaurant,French Restaurant,Café,Opera House,Indie Movie Theater,Park


In [None]:
final_data=final_data[final_data['Neighborhood'].isin(final_NB)]
final_data

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels
2,Washington Heights,40.851903,-73.9369,4
25,Manhattan Valley,40.797307,-73.964286,4
30,Carnegie Hill,40.782683,-73.953256,4
13,Lincoln Square,40.773529,-73.985338,4


# Mapping Manhattan Subway locations

#### Manhattan subway metro locations (address) was obtained from webscrapping site NYC Open Data.

In [None]:
MA_subway_df = pd.read_csv('https://data.cityofnewyork.us/api/views/kk4q-3rt2/rows.csv')
MA_subway_df = MA_subway_df[['NAME','the_geom']]
MA_subway_df.columns = ['Station','location']
MA_subway_df.head()

Unnamed: 0,Station,location
0,Astor Pl,POINT (-73.99106999861966 40.73005400028978)
1,Canal St,POINT (-74.00019299927328 40.71880300107709)
2,50th St,POINT (-73.98384899986625 40.76172799961419)
3,Bergen St,POINT (-73.97499915116808 40.68086213682956)
4,Pennsylvania Ave,POINT (-73.89488591154061 40.66471445143568)


In [None]:
split1 = MA_subway_df['location'].str.split(r'[()]', expand=True)
split1[['longitude','latitude']]=split1[1].str.split(' ',expand=True)
split1.drop(columns=[1], inplace=True)
split1.head()

Unnamed: 0,0,2,longitude,latitude
0,POINT,,-73.99106999861966,40.73005400028978
1,POINT,,-74.00019299927328,40.71880300107709
2,POINT,,-73.98384899986625,40.76172799961419
3,POINT,,-73.97499915116808,40.68086213682956
4,POINT,,-73.89488591154061,40.66471445143568


In [None]:
#MA_subway_df[['address1','address2']] = split1[[0,2]]
MA_subway_df[['latitude','longitude']] = split1[['latitude','longitude']].astype('float64')
MA_subway_df.drop(columns='location', inplace=True)
print(MA_subway_df.shape)
MA_subway_df.head()

(473, 3)


Unnamed: 0,Station,latitude,longitude
0,Astor Pl,40.730054,-73.99107
1,Canal St,40.718803,-74.000193
2,50th St,40.761728,-73.983849
3,Bergen St,40.680862,-73.974999
4,Pennsylvania Ave,40.664714,-73.894886


#### Visualize map with subway stations and neighborhood clusters

In [None]:
map_MA_subway = folium.Map(location=[MA_neighborhood_latitude, MA_neighborhood_longitude], zoom_start=12)

# add markers of subway locations to map
for lat, lng, label in zip(MA_subway_df['latitude'], MA_subway_df['longitude'],  MA_subway_df['Station'] ):
    label = folium.Popup(label, parse_html=True)
    folium.RegularPolygonMarker(
        [lat, lng],
        number_of_sides=6,
        radius=6,
        popup=label,
        color='red',
        fill_color='red',
        fill_opacity=2.5,
    ).add_to(map_MA_subway) 

for lat, lon, poi, cluster in zip(final_data['Latitude'], final_data['Longitude'], final_data['Neighborhood'], final_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_MA_subway)    
    
map_MA_subway

Let's print complete map with neighboorhood clusters, shools, subway stations and registered crimes

In [None]:
map_MA_complete = folium.Map(location=[MA_neighborhood_latitude, MA_neighborhood_longitude], zoom_start=12)

# instantiate a mark cluster object for the incidents in the dataframe
crime = plugins.MarkerCluster().add_to(map_MA_complete)

for lat, lng, label in zip(MA_crime_df['Latitude'], MA_crime_df['Longitude'], MA_crime_df['Category']):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(crime)

for lat, lon, poi, cluster in zip(final_data['Latitude'], final_data['Longitude'], final_data['Neighborhood'], final_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=20,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7).add_to(map_MA_complete)
    
for lat, lng, label in zip(MA_subway_df['latitude'], MA_subway_df['longitude'],  MA_subway_df['Station'] ):
    label = folium.Popup(label, parse_html=True)
    folium.RegularPolygonMarker(
        [lat, lng],
        number_of_sides=6,
        radius=6,
        popup=label,
        color='red',
        fill_color='red',
        fill_opacity=2.5,
    ).add_to(map_MA_complete) 

for lat, lng, label in zip(MA_schools_df['latitude'], MA_schools_df['longitude'], MA_schools_df['school_type']):
    label = folium.Popup(label, parse_html=True)
    folium.RegularPolygonMarker(
        [lat, lng],
        number_of_sides=4,
        radius=4,
        popup=label,
        color='green',
        fill_color='green',
    ).add_to(map_MA_complete)
    
map_MA_complete

This notebook is part of a course on **Coursera** called *Applied Data Science Capstone*